From the 1970s onward, IT software projects acquired a notorious complexity due to the advancement of hardware (exponential rise in processing power of the personal computer) and software technologies (proliferation of programming languages and frameworks).
These advancements captivated users and radically changed their preferences, leaning ever more towards modernization and digitalization.
Today, and more than ever, digitalization has pervaded our lives, and software exists for almost everything.
Standalone solutions or products are now scarce, and the integration of mature technology and platforms in large IT systems allows businesses to offer rich, seamless, end-to-end services.
As project size spiralled upwards due to adding more platforms to the standard offering, careful project planning and solution design became critical to lowering project risk to a comfortable level so that it could be carried over during execution.
Agile design and its tendency to bypass top-down, overarching plans favouring small incremental modifications and faster releases were much too dangerous as the cost of change in large projects soared. Much forethought had to be placed into solution architecture to avoid costly rework.
In the first instalment of this series, we examined the core principles of solution design. This article will zoom in on the design of medium and large system integration projects, and part three will discuss Agile design and its applicability in specific contexts.
Note 1 — In the remainder of this text, whenever we discuss solution design, it will be solely in the context of large integration projects. For core ideas and first principles, we recommend reading Part 1.
Note 2 — If the reader also requires some background in system integration or interface design, both integral topics to this discussion, we recommend you check the relevant articles.
3. Solution Design in Systems Integration
The story begins when the organization’s stakeholders acknowledge and decide to do something about a specific business need.
At first, these needs are broadly defined; for example, designing and implementing an algorithm to predict the demand for umbrellas in the rainy season.
Requirement definition results in a clear set of objectives, some of which are functional, while others can be non-functional or project constraints. Non-functional requirements are fundamental to large software systems and will be discussed later in detail.
The stage is now set, and architects can construct the high-level design. Potential solutions are explored at a conceptual level during this stage, and a solution candidate is selected.
After the high-level design is approved, tech leads and senior developers produce a low-level solution design in a series of iterations and revisions involving Business Analysts.
Low-level technical decisions about the product and its use cases are made, and parameters are tweaked to satisfy its usability constraints.
Using the umbrella demand forecast application as an example, we can think of the high-level design decisions as follows:
- The choice of the prediction engine (regression models, neural networks)
- The data processing pipeline (sanitization, generation of training sets)
- Model update and deployment
- Technology stack
- Training algorithms
- Non-functional requirements such as redundancy and high availability.
Low-level technical design decisions might involve:
- The User Interface (UI)
- The deployment model
- Backup and recovery
- Edge cases
- Non-functional requirements such as compliance and performance.
It is crucial to involve the customers in both design stages, and this involvement is especially vital in large projects where feedback loops can be very long.
It may be too late and expensive to correct design problems after they have been heavily integrated into the more comprehensive solution and their effects propagated and mitigated in downstream applications.
This process involving one high and another low-level design phase with many iterations in between them is the generic approach for solution design in system integration projects.
3.2 Exploring Many Designs
Executives at Toyota made it a requirement to consider a broad range of alternatives during the decision-making process. They even encouraged their engineers to seek feedback from specialists in different areas or departments for maximum diversification.
Toyota understood that although a full exploration of the design space can be costly and will inevitably delay product launches, this cost can be justified when the stakes are very high.
Toyota executives stress the importance of efficient implementations to compensate for the time spent deliberating on design decisions.
In software, alternatives often exist, and requirements for meeting deadlines may overshadow the viability of this exploration. This short-term tactical win is inefficient in the long term as it helps accumulate architectural and technical debt.
3.3 Solution Design Optimization
One of the outcomes of a solution architecture is a vector of design and operation parameters that can be combined and varied in different permutations.
Each combination will produce a slightly different design that behaves similarly at its core level but with slight variations in more peripheral areas.
One example of a design parameter is the number of TCP links between two components. One link is easy to manage, while multiple links make session and communication management slightly more complicated but increase throughput.
An operational parameter can be the information displayed on the dashboard of an infrastructure monitoring application. The amount of data can be increased for better sense-making of problems, albeit at the cost of additional data storage and processing.
We can think of this tuning process of design parameters as the optimization of an objective or cost function.
The complexity of the optimization process is because:
- Any change in one design parameter typically induces an adverse change in another. In most cases, better performance and more features can be attained by increasing complexity, cost, and project risk.
- Another challenge lies in swiftly and effectively concluding a multi-actor, cross-departmental decision-making exercise if that happens to be the case.
A sweet spot (or Pareto optimum) must be located. You know you have hit the Pareto optimum when no change in any parameter can be made without negatively impacting another.
There could exist not one but multiple Pareto optimums, and part of the challenge is to know which solution will work best.
Consider the above graph, where system availability and performance are plotted. The idea is as follows: increasing the throughput of a system also increases the risk of damaging it and putting it out of service. The 2D surface in the graph represents the set of Pareto optimums available.
You can decide which criterion is more important, performance or availability—an SLA placing system availability at 99.9%, for example, will determine the peak performance achievable.
Naturally, if you want to increase performance and availability simultaneously, you must upgrade your infrastructure at a further cost.
One method of preventing a deadlock is ranking design criteria by order of importance to stakeholders. We discuss this in the next section in more detail.
The optimization process, however, should produce a robust design vis-a-vis the criteria ranking. This robustness is vital as stakeholder influence and preferences change; you want to keep the same solution despite changes in opinions or choices.
3.4 Ranking Decision Criteria
In profit-seeking organizations, the top-priority measure is long-term profit. Technical superiority is only desirable if it serves the business through competitive advantage, advancing the value proposition, or lowering long-term costs. Beyond that point, it is typically frowned upon by whoever is in charge of financials.
These subtle and conflicting constraints make technical choices non-trivial and subject to intense negotiations between business, technology, and finance departments.
Profit and utility aside, as long as the final product does the job, there is much room for maneuvering when fine-tuning design parameters. Consider the following example of an online and batch system in an ideal scenario.
|Attribute||Online System||Batch System|
|Utility, functionality, business value||Non-negotiable||Non-negotiable|
|Availability||A top priority in mission-critical systems||Medium priority as batch systems is required to be online only when needed.|
|Performance||High priority for the best user experience||Medium priority as long as the work is done within a reasonable time frame|
|Technology||Best in breed — it is more important to get optimal performance even at the cost of increased time-to-market||Mature technology that is easy to learn is OK if it delivers the required functionality.|
Key figures will impose their priorities, especially in a zero-sum game. A large part of a project’s success is guaranteed by securing long-term leadership support.
3.5 How Much Design is Enough
You can explore the design space along two dimensions:
- Breadth, i.e. investigating many alternatives
- Depth, or exploring different decomposition levels
Ideally, you want to explore as many alternative designs as possible and perform as much decomposition and lower-level design as practicable. However, the exploration process might take a lot of time and resources if the system is large and complex and you typically want to operate within the allocated budget and timeframes.
So, how much exploration is enough? How much simulation or proof of concepts is sufficient before kicking off implementation?
The following excerpt from the Nasa Systems Engineering Handbook provides some insights on the stopping criteria.
The depth of the design effort should be sufficient to allow analytical verification of the design to the requirements. The design should be feasible and credible when judged by a knowledgeable independent review team and should have sufficient depth to support cost modelling and operational assessment.
The design may also involve POCs and simulations, themselves costly and time-consuming exercises. System architects and engineers must optimize their design efforts to achieve maximum fidelity at an acceptable cost.
If prototyping is not an option, expert intuition must be relied on at the cost of fidelity.
You also want to ensure that the selected candidate is also technically feasible.
- Breadth can be achieved by involving as many stakeholders as practicable in the design review exercise.
- Depth will depend on how much you trust your engineers and developers to make low-level technical decisions independently.
- How much simulation or proofs of concept are enough depends on the maturity and technological readiness of the solution concept adopted.
3.6 Solution vs Software Architecture
Several articles on this website have been dedicated to architecture, both software and solution. In these short paragraphs, we will focus only on their relevance in the design of system integration projects.
One of the more straightforward definitions of architecture involves structures of connected nodes in a cluster.
The nodes are platforms, applications, or systems in the case of a solution, typically connected by interfaces. In contrast, in software architecture, the nodes can be classes or domains, and the connections are either structural or behavioural.
Where does the design come into the picture?
Architecture is a prerequisite for any further analysis or optimization. You usually choose how you want something to be done (architecting a solution), then design and implement its components.
4. When Does Solution Design Take Place
We have previously mentioned that solution designs come in two forms: low-level and high-level design, each serving a different role and being prepared in a different phase of the software development life cycle or SDLC.
During the Analysis stage, a solution architect creates the project’s High-Level Solution Design (HLD). The HLD is an architectural plan for the IT system and is usually submitted alongside the Statement of Work (SOW).
The HLD is destined for a slightly technical but highly business audience, and its purpose is:
- To furnish stakeholders with a bird’s eye view of the system’s target state once implementation is completed. The stakeholders can then provide any comments and feedback before signing off.
- Creates multiple conceptual solutions and selects the fittest. It also determines the high-level components and functions of the system, as well as a mapping between them. Finally, it establishes a set of critical design and operational parameters.
Once the SOW and HLD are signed-off, and before any development starts, the Design phase kicks off, during which software engineers prepare the Low-Level Solution Design or LLD.
The aim of the Low-Level Solution Design is as follows:
- Performs a further logical or functional decomposition of the main components and functions of the solution and selects the subcomponents to be used. The design and specifications for these subcomponents are also determined at this stage.
- Provide a reference for developers and facilitate user story creation and implementation and writing unit or integration test cases, thus boosting the application of Test-Driven Development, a practice we highly recommend.
- Provide a basis for stakeholder acceptance which can be completed through testing, requirement verification, and quality assurance.
5. Why Is Solution Design Important
Let’s start this discussion with a compelling anecdote illustrating how a massive gap can arise between a product implemented at a considerable cost and what the customers need.
5.1 Failure to Address a Business Need
In February 2008, Qantas – Australia’s national airline – cancelled Jetsmart, a $40 million parts management system implementation. The engineers simply refused to use it. The union’s Federal Secretary explained that the software was poorly designed and difficult to use and that engineers didn’t receive sufficient training.
The three reasons given by the union’s Federal Secretary to explain the abandonment of the project will be, as well as others, the subject of this section’s discussion.
We will argue that a proper design process should help eliminate such risks.
5.2 Avoiding Costly Redesign
Most would agree that the sooner an error is detected, the less costly it is to fix it—this means careful planning and design at the early stages of the project helps control uncertainty and project risk.
Rework is not an issue as long as it’s cheap, but such cases are generally rare. The price of rework is measured by the cost of change and comes from the following:
- The nature of the project — In some cases, little can be done to avoid high expenditures when rework is required. The cost of change rises with scale as more people and effort are involved in redesigning a feature, and the longer the problem goes undetected, the harder it will be to fix it, especially if more features are built around it.
- The project delivery methodology — Agile projects are more suited for earlier error detection because of shorter feedback loops. On the other hand, Agile is not always applicable, although a hybrid model can be adopted with no risk.
When discussing rework or redesign, a distinction must be made between corrective measures and experimentation.
In the latter case, you typically carry more than one design and try these out (simulations, proof of concept) to test their feasibility and suitability. In the former, rework is created due to failure in the analysis or implementation stages, and this failure could have been avoided with more diligence.
Diligence in designing solutions can reduce or even eliminate rework.
5.3 Collaboration and Stakeholder Management
IT projects today involve a broad gamut of expertise from cross-functional departments such as business, security, compliance, infrastructure, procurement and technology.
This complexity requires high levels of collaboration on, possibly, multiple projects running simultaneously.
They become more acute in the following situations:
- The group is dynamic —
- With members popping in and out from different projects or outside the organisation, teams need more time to know each other, get familiar with each other’s modus operandi, and trust each other. No trust, no collaboration.
- Maintaining a coherent and functioning structure requires mature processes and a solid organisational culture. Information should be readily available and easily accessible to newcomers.
- The project is sizeable —
- With many platforms incorporated into the solution and where no single person is an expert in the whole stack, people need to understand the big picture without hunting for information.
- Dependencies on other project streams have to be articulated and deliveries synchronised. Otherwise, project managers will have a hard time understanding what to manage.
Collaboration can be enhanced by channelling high-quality information on design to all stakeholders.
Tools like JIRA, Confluence, and MS Sharepoint facilitate access, navigation, search, sharing, and collaboration on business or technical knowledge, immensely facilitating knowledge management across the organisation.
To mitigate any risk from misaligned requirements, schedules, timeframes, and priorities, solution design exercises bring relevant (and competing) stakeholders together (business/technology, client/vendor).
5.4 Reduces Project Risk
Following is a list of project risks that you can minimize through careful planning and design:
- Deploying new features into production is always an event. When designing a solution, ensure you take the necessary steps to minimize the risk of lengthy downtimes, service interruption, or deterioration.
- Consider backward compatibility and impact on existing features, interfaces, or services. Include the necessary tests and dry runs to ensure seamless promotion to production.
- Production deployment approaches (big bang vs phased), pilot testing, and other tools can be used to lower the risk of deploying new features or migrating new data.
5.5 Getting Estimations Right
You can only effectively manage a project if you know how much it will cost and how long it will take to build. Consequently, effort estimation must be as accurate as possible, which is only realisable if you know exactly what steps, resources, and time is involved.
Effort estimation occurs at two levels.
- T-shirt size or Software Order of Magnitude is a high-level estimate allowing project leaders to calculate approximate budgets and schedules. This exercise follows from the high-level architecture and design.
- The estimates are then refined by considering the low-level details. Tasks can grow and shrink once the exact business requirements are defined.
Keeping the two levels of estimations in sync is not obvious, but that’s a topic for a different day.
With a detailed and comprehensive design, you can mitigate the optimism bias and avoid overlooking overhead tasks, of which the below list has some examples:
- Environment readiness
- Change management tasks and approvals
- Lead times to procurement of licenses and hardware
5.6 Designing Products for the Future
It is hard to predict the technical requirements of future IT solutions and how customer preferences will change.
What you can do, however, is:
- To ensure, as much as practicable, that you defer consequential decisions as long as possible.
- Avoid self-imposed unnecessary constraints that might hinder future changes.
6. Solution Design Requirements
When designing a system, we look for an effective solution (does the job) that is also efficient (does it well) and cost-effective.
The diagram below articulates these ideas with a hierarchy where Operational Excellence (in the traditional sense, slightly different from what we use on their website for Software Development) is on the top.
A practical solution can only achieve some things, and once the optimal design has been found, changes from there on would mean a compromise of one aspect for another.
As explained earlier, the solution design process is an optimization exercise that selects values for key design and operating parameters to maximize an objective function such as utility/cost. The sometimes conflicting requirements from the above diagram make this process more art (through heuristics) than science.
6.1 Functional Requirements
Functional requirements (also called business requirements) cover the following:
- Services and features that have business value to the user. They typically solve a business need.
- User experience, including user-friendliness, aesthetics, and running costs.
- The system’s behaviour in nominal and exceptional scenarios.
The system users are generally happy to pay for the services they receive. For example, a video game’s functional requirement may cover gaming experience, which OS systems it can run on, and the minimum hardware required.
6.2 Non-Functional Requirements
Non-functional requirements make the product viable and are equally interesting to the organisation and the end user.
The end user does not directly pay for non-functional requirements; the supplier does through initial investments and running costs.
Returning to our example of video game software, the typical gamer is not necessarily thrilled by the software’s technology stack, the CI/CD pipeline on which it gets built, or its maintainability and modularity as long as they have a great gaming experience.
Maintainability and modularity, however, are interesting for the vendor as they can keep the costs down and increase the value proposition.
Most industries, especially in the financial or health sectors, are regulated by public or private regulatory bodies.
For example, the payment industry has many regulations to follow. Most notable are EMV and PA:DSS set by global organizations such as Visa and Mastercard.
Commonly, whole projects are sometimes implemented to comply with the latest compliance mandates.
Solution designs should observe rules and regulations for the industry, project, or software implemented.
Software and cyber security requirements are essential to any IT project. See ISO 25010 standard on software quality for a complete list.
Security can be broken down into five major categories:
- Confidentiality — Protecting private data from unauthorized access
- Integrity — Or how easy it is to identify messages that have been exchanged between two parties and were either modified or tampered with along the way
- Non-repudiation — The degree to which actions or events can be proven to have occurred so that the events or activities cannot be repudiated later.
- Accountability — This aspect deals with the firm’s ability to audit sensitive activities on the system.
- Authenticity — The degree to which internal or external entities accessing the system can be authenticated.
Security requirements must be observed in all designs involving sensitive information such as user and customer data.
A system is scalable if it can be easily expanded (either vertically by adding more resources or horizontally by adding more nodes/instances) to accommodate additional load or higher performance requirements.
An example of a perfectly scalable system is a modern database engine like Oracle; it can be scaled horizontally (by adding more instances) and vertically (by adding more discs) without bringing down the service!
Extensibility and modularity are the hallmarks of great architecture and design.
- A system is extensible if it can be easily extended with new functionality without touching its core components or adversely impacting existing features. A minimal coupling between system components and features facilitates extensibility.
- A modular product or solution comprises smaller, independent, self-contained components that can work together. Modular pieces can be added, removed, enabled, or disabled on demand.
Modularity is especially vital in mega-projects as it allows learning from mistakes without jeopardizing the entire project. With modular designs, systems can scale quickly and go into production as soon as the first unit is operational.
6.2.5 Fault Tolerance
Fault tolerance refers to the system’s ability to continue functioning even though one or more parts have failed.
Fault tolerance is critical in large system integration projects where many components collectively support a service; in such cases, the failure of one piece might knock the service out, and fallback mechanisms need to be implemented to prevent that.
Two systems are compatible if they can appropriately integrate and exchange data.
A different example of compatibility can be with OS platforms or hardware. An application is compatible with certain Operating Systems or platforms if it can be successfully deployed.
A good solution design does not limit itself to working only with specific software, hardware, or message types. Use industry-standard messaging and data transmission protocols to integrate with similar third-party systems.
7. Project Constraints
Project constraints serve as boundaries around requirements and are not the same. Given the requirements, an infinite number of designs can be put forward, with only a subset of which will be commercially viable.
You typically want to build a product or feature as quickly as possible to benefit from any windows of opportunity in the market. A specific budget also limits you.
Budgetary and schedule constraints are what are customarily encountered for most projects.
Every project requires budgeting, and once the budget is allocated, it will not be easy to modify. For this reason, it is vital to have a very succinct idea of:
- How much effort is involved
- What type of skills and headcount is required
- If additional infrastructure or licenses are needed
- Any other project expenses
A solution design is a great place to tackle these questions. A cost-efficient design will also lower maintenance and operational costs.
7.2 Schedule and Timeline
Missing deadlines can mean a lot to a team. They can lose to the competition, incur fines from regulatory agencies, or send the wrong message to senior management.
The effort estimations generated from the solution design and how many resources are available for this specific project are usually what you need to get the dates right.
In Systems Engineering, you typically create an objective function representing the criteria your stakeholders care about.
The objective function is parameterized by the system design and operation parameters, and you would usually use Multi-Disciplinary Optimization (MDO) techniques to try and optimize that function.
Software engineering does not have the equivalent of an objective function. Instead, you have a set of requirements, some mandatory while others are nice-to-have. This differentiation arises between the two disciplines and gives rise to two phenomena.
- The first phenomenon is that software and solution design are more of an art than hard science. Still, like art, they are governed by heuristics laboriously gathered over the years.
- The second phenomenon is that every new project that adds features to the solution will have to leverage the existing design; it would not be cost-efficient to redesign a solution from the ground up with every new addition. This approach to solution building produces multi-layered architectural hierarchies that are the hallmark of complexity, and the latter always comes at a cost.
For these two reasons, solution and software design offer additional challenges that architects and engineers must tackle, and it’s never a good idea to sacrifice design efforts in an unsustainable manner.
9. Further Reading
- Great talk by Martin Fowler on Agile Architecture.
- Fantastic lecture by Robert Martin on Clean Architecture.
- Design Definition and Multidisciplinary Optimization — MIT OpenCourseWare