Technical Risk Management and Decision Analysis — Introduction and Fundamental Principles

1. Overview

I could not find a better way to start an article on Risk Management than by quoting the opening lines of Donald Lessard and Roger Miller’s 2001 paper that, briefly but lucidly, summarises the nature of large engineering endeavours. It goes like this:

Large engineering projects are high-stakes games characterized by substantial irreversible commitments, skewed reward structures in case of success, and high probabilities of failure.

— Understanding and Managing Risk in Large Engineering Projects

This article draws heavily on three handbooks that NASA has published (see the References section for links) over the past decade or two.

These handbooks thoroughly discuss Risk Management for large engineering projects, and I believe they would immensely benefit people in the software business.

While a rudimentary risk management system is sufficient in most cases, especially for small projects, it is always an asset to possess adequate knowledge and enough depth and rigour when treating any topic that can adversely impact the organisation’s capabilities of delivering successful projects.

This approach, focusing on a rigorous, in-depth, and scientific education in the software business, is a primary tenet of Operational Excellence.

2. Project Risk

2.1 Definition of Risk

Intuitively, when considering risk, we tend to focus on the probability of adverse events while overlooking their potential impact. Risk, however, is the possibility of something bad happening, compounded by its consequences.

The NASA Risk Management Handbook provides the following definition of risk in Systems Engineering.

Risk is the potential for performance shortfalls, which may be realized in the future, with respect to achieving explicitly established and stated performance requirements.

— NASA Risk Management Handbook

As we shall see in the next section, risk can cover three domains of performance-related objectives.

In this sense, performance measures how well we have achieved our goals (quality, technical, budgetary, schedule, requirement satisfaction). It is not strictly tied to the system’s performance (processing speed, throughput, capacity, load-bearing).

The Definition of Risk in NASA's Systems Engineering Handbook — The Definition of Risk in NASA’s Systems Engineering Handbook

As applied in Risk Management, risk items are characterised as a set of triplets made up of the following elements:

One or more scenarios lead to a degraded performance concerning a specific objective.
- Examples of impacted objectives can be budget overruns, delayed schedules, poor system performance, financial loss, loss to competition, reputational loss, or non-compliance with regulatory mandates.
- Generally, a scenario is a sequence of credible events that specify a system’s evolution from one state to another. In risk management, they identify how a system can evolve to an undesirable state from its current one.
The likelihood of these scenarios occurring can be measured qualitatively or quantitatively.
- Suppose historical data or statistical models (for example, relating reliability, security, or capacity to independent variables such as hardware size, specific configuration, or setup) exist. In that case, they can be used to derive a numerical measure of the likelihood provided that the conditions between the past and the present have not changed.
- Otherwise, Detailed Logic Modeling Estimation can be used if data is unavailable. This method breaks the system down into logical parts in a top-down approach. Then, a bottom-up approach is applied to quantify the risk for every module across different levels for each scenario or event. The evaluations collected at the component level are then integrated using mathematical or Boolean logic to assess overall risk.
The consequences or impact should any of the scenarios occur. The severity can be qualitative or quantitative. For example:
- An online processing system can experience a degradation in performance ranging from 0% to 100%. A budget can be overrun by a certain amount.
- On the other hand, the reputational damage incurred can be measured on a discrete scale of, for example, one to five.

If any uncertainties exist, they are typically included when evaluating the likelihood of a scenario and its consequences by providing lower and upper bounds.

2.2 Types and Sources of Risk

Four types of risk can be associated with implementation projects.

These are:

Budgetary risk is associated with the inability to meet fiscal requirements. There are two primary scenarios that can lead to this happening. The solution design, development, verification, and validation estimates were inaccurate in the first scenario. In the second scenario, budgets are overrun due to the inability to meet technical (novel or immature technology) or schedule constraints (unforeseen staff absences, poor processes, lack of skills, or removal of key resources).
Schedule risk can be related to inadequate planning or resource allocation, unreasonable timelines, poor time-management practices, inaccurate effort estimation processes, or a lack of project management skills.
Technical Risk: Two primary sources of technical risk can be identified. The first is related to poor production processes across the different stages of the SDLC: analysis, design, development, testing, and quality assurance. The other reason is technological challenges, such as immature technology stacks, poor technological decisions, and a lack of specialised skills.
Programmatic risk is associated with areas outside the project manager’s control, such as regulation or compliance updates, shifting priorities in the organisation, unforeseen technological, schedule or cost-associated risks, supplier delays, poor quality third-party products, or any other external dependency.

In general, a Risk Management process will attempt to address some of the primary issues that have historically derailed IT programs. In addition to what has already been mentioned above, we summarise these issues as follows:

The mismatch between stakeholder expectations and the actual effort and resources required to implement the project and address the risks involved
The underestimation of risks involved and the overpromising by the decision-makers to the stakeholders
The miscommunication of risks associated with different decisions or alternatives

3. Risk Management Strategy

3.1 What Is Risk Management

Lessard and Miller’s paper stated that “successful projects are shaped with risk resolution in mind” and “successful sponsors[…] embark on shaping efforts to influence risk drivers.

Leaning on the statement above, we can think of risk management as:

Recognising the interaction between external risk drivers and internal managerial choices
Identifying leverage points in a project and applying effort to these leverage points to influence risk drivers.

The objective of risk management is not the total elimination of risk but the identification of viable projects where future success and revenue generation is possible.

3.2 General Considerations

The following are the primary considerations when designing a sound risk management strategy for implementing complex software solutions.

Solution design and the concepts they are based upon are developed sufficiently, to the right amount of detail, to enable a correct assessment of the risk involved.
Subsystems of the solution can be profiled, and those deemed riskier can benefit from the additional design, verification, and validation efforts (simulations, analysis, proof of concepts).
A threat (and opportunity) list is maintained throughout the project, and the cost associated with the aggregate risk is calculated. Risk entries are monitored, and their impact, likelihood, and cost are updated periodically. A mitigation plan can be implemented to deal with the risks. Risks are prioritised, and the highest priorities can claim easier access to the project reserves.
Identifying the risk tolerance levels for every performance measure is crucial for all stakeholders. Recall that a performance measure is a value that tells us how close or far we are from the best possible position we have for a specific objective. The risk tolerance will allow us to set a lower boundary for each goal. It also allows a fair comparison between alternative solutions. An ideal solution will be insensitive to changes in risk tolerance levels.

3.3 A Successful Risk Strategy

According to the NASA Risk Management Handbook, a successful risk management strategy must be based on the following tenets:

Qualitative assessments must give way to quantitative analyses where applicable and practical. This shift should apply to all levels of risk management, from individual risks to aggregate ones, and from handling the impact of a single adverse event to managing its primary sources and drivers.
The risk management processes should be systematic, well-developed, and integrated into a formal risk analysis framework. The outcomes of these processes should be communicated effectively to decision-makers, enabling them to make informed decisions.
The risk management strategy and the decision-making process must involve the right stakeholders and consider their risk tolerance levels when considering multiple alternatives.
The NASA handbook acknowledges the ease with which decisions can be made by individuals working solo due to highly developed decision-making methods. On the other hand, challenges quickly arise inside complex organisational hierarchies. Therefore, any risk management strategy should consider the complexity of an organisation’s structure, organisational culture, and key stakeholder policies and plans.

3.4 When Is Risk Management Applied

NASA uses the Systems Engineering Engine to execute its mission projects, an “engine” well-suited for complex engineering endeavours. The diagram below shows a simplified version of that engine (ref. NASA Risk Management Handbook Figure 1).

Technical risk management falls under Control; the stakeholder requirements, interfaces, data, and configuration are locked and monitored for change. If an update needs to be applied, it undergoes a change management process, and the risk is reassessed.

However, we use a combination of Agile or Waterfall-based delivery methodologies in software initiatives with the Software Development Lifecycle (SDLC) concepts, neither of which discusses risk or prescribes a risk management strategy.

NASA's Systems Engineering Engine — NASA’s Systems Engineering Engine

That is not to say that no risk management occurs in software projects. Risk management is often conducted at the project or program management level, with limited interaction with technical stakeholders and a lack of a formal and rigorous framework.

Only when something goes wrong that directly impacts the project’s timeline or budget is a flag raised, usually for the first time. In my 15+ years of experience in the industry, I have never participated in or been aware of a risk identification session being conducted.

So, how is risk management applied in software projects? Can any specific and fundamental principles or best practices be successfully implemented?

If you think closely about Agile, you will find it has a unique attitude towards change. In effect, Agile’s second principle goes like this:

Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.

— Agile Manifesto

Agile practitioners are required to welcome change, even when it occurs late in the process; however, there’s no mention of risk, risk management, or mitigation plans.

The issue is that Agile is not always applicable, and even if you are applying Agile, it is not wise to assume all risks. This statement is even more accurate when you have large programs with multiple streams and coordinated activities.

The interdependencies between the different streams can be very strong for such projects, making them highly sensitive to potential threats. In such cases, an adverse event in one stream may soon cascade upstream, downstream, or sideways with devastating effects.

Change must be controlled and monitored in megaprojects (multiple streams, complex solutions, global presence, and multi-cultural teams), and risk management must be applied to prevent failures.

A risk management framework needs to be implemented within the SDLC and project management space to be effective.

Although the implementation will undoubtedly be context-dependent, we believe the ideas discussed in this article remain valuable.

3.5 The Iron Triangle

The Iron Triangle is a great way to visualize the interactions between three influential risk drivers in any program: cost, schedule, and technical.

Iron Triangle of Cost, Schedule, and Technical Risk

Two main ideas govern these interactions:

Programmatic constraints, such as limited budgets, minimal time-to-market, and high-quality standards, restrict the range in which cost, schedule, and technical risk can be managed.
Any attempt to reduce the risk in one of the three areas will likely increase the risk in at least one of the other two.

For example:

Budget cuts often result in sacrifices to testing and quality assurance, which can lead to safety and quality issues.
Tight timeframes may require additional resources, leading to increased costs. This may also mean reducing or eliminating enabling and supporting activities, such as solution design, testing, and quality assurance.

Program managers, system engineers, and risk analysts work together to find an optimal distribution of risk among the three categories that leads to a viable solution.

3.6 Avoiding Unnecessary Risks

To illustrate these concepts further, let’s try and marry them with a few of the Operational Excellence principles we advocate for software development.

Principle 2 discusses process management and governance, without which the project will inevitably suffer losses in all three areas: cost, schedule, and quality.

Lean, efficient, and effective production processes result in larger profit margins, more room for continuous improvement, a competitive edge, a stronger value proposition, and higher-quality products.

Superior processes and exceptional talent enable you to achieve the best quality at the lowest budgets and in the shortest timeframes.

Principle 3 stresses the value of quality deliveries, while principle 5 discusses the vitality of focusing on business value. The central theme here is that a limit exists, although difficult to determine, beyond which the project is not viable.

Accepting non-viable projects is a risk to your business. It diverts valuable resources to non-profitable projects, distracting your team and preventing them from focusing on strategic initiatives, while accumulating large amounts of technical debt by accommodating unreasonable timeframes.

The danger arising from these scenarios is due to their highly technical nature and long-term impact that is not discernible or evident in the present.

Finally, principle 6 provides a method for curtailing novel or immature technology risk. Fortunately, the vibrant and diverse technology stacks and existing frameworks in today’s software world make this endeavour straightforward.

4. Risk-Informed Decision Making

4.1 Definition

The NASA Systems Engineering Handbook distinguishes between Risk Management and Risk-Informed Decision-Making (RIDM). NASA’s risk management approach combines Continuous Risk Management (CRM) and Risk-Informed Decision-Making (RIDM).

CRM monitors and tracks risk throughout the project’s implementation, while RIDM is used to help increase awareness of risk during critical decision-making during execution.

RIDM helps decision-makers choose between two competing alternatives by comparing the risks involved. The objective is to help avoid redesign and rework activities that jeopardise costs and schedules.

RIDM is invoked for key decisions such as architecture and design decisions, make-buy decisions, source selection in major procurements, and budget reallocation (allocation of reserves), which typically involve requirements-setting or rebasing of requirements.

— NASA Risk-Informed Decision-Making Handbook

During a product’s implementation lifecycle, critical decisions are typically made that require a cost-benefit analysis and risk balancing. RIDM naturally comes into play in such situations. These decisions are characterised by the following:

High Stakes — High stakes may include significant costs, an irreversible commitment of resources, safety, and business survival.
Complexity — The future implications of the decisions are not evident and require detailed analysis.
Uncertainty — Uncertainty in inputs may lead to uncertain outcomes, requiring risk analysis for various possibilities.
Stakeholder Management — Stakeholder management is essential when numerous stakeholders have different preferences, risk tolerance levels, policies, and influence.

4.2 The RIDM Process

The NASA Risk-Informed Decision-Making Handbook is 256 pages long, and the following few paragraphs will not do it full justice.

We recommend you go through the book for a comprehensive understanding as the below will consist only of a high-level illustration of the process.

The Risk-Informed Decision-Making process consists of three stages:

Stage 1: Identification of alternatives — During this stage, several alternative solutions that satisfy the performance measures associated with the performance objectives are compiled. Performance measures that fall within an acceptable range are called constraints, project cost being one example. While performance measures are being created, stakeholder analysis is carried out simultaneously and influences the allowable values of the performance measures.
Stage 2: Risk Analysis of Alternatives — Each alternative is weighed against the anticipated performance objectives it is expected to produce, considering any uncertainties that may influence the input. The outcome of this process is a probability distribution of each performance measure for each alternative. The decision must be robust to model perturbations and reasonably foreseeable information.
Stage 2: Risk-Informed Alternative Selection — A normalised method must be established to compare multiple alternatives. This normalisation is required since each option’s performance measure probability distribution can differ. A fair comparison method that gives equal chances for each alternative must be used for the process to be consistent.

5. Continuous Risk Management

NASA applies Continuous Risk Management (CRM) to its overall Risk Management Strategy. The relationship between CRM and RIDM is complex, and we recommend referring to the handbook for an in-depth discussion.

For this section, suffice it to say that Risk-Informed Decision-Making will help select the alternative. With that alternative comes a set of risks that need to be managed and performance objectives that need to be achieved; these risks and goals will form the input into CRM.

The CRM process consists of six stages:

Identify — In this stage, a set of risks are identified. Individual risk is defined in the handbook by four components:
- Condition — A condition describes the circumstances that would lead to the realisation of the risk scenario.
- Departure — A departure refers to a specific change in the baseline project plan that is facilitated by a particular condition.
- Asset — An asset represents a project resource adversely impacted by the risk.
- Consequence — The consequence lists the unfavourable impacts on the organisation’s capabilities to meet its objectives.
Analyze — During this phase, we estimate the likelihood of the risk, its potential impact, and any uncertainties involved. It also identifies the stage of the project at which the risk first emerged and catalogues any new threats.
Plan — During the planning phase, stakeholders determine the actions to take for each risk. The handbook lists five possible options:
- Accept — The risk is deemed acceptable, and nothing further is done.
- Mitigate — Actions are taken to influence the risk drivers.
- Watch — Data is collected on the risk drivers.
- Research — A typical recourse when significant uncertainties are involved, and stakeholders believe can be reduced with further investigations.
- Elevate — A form of escalation to higher echelons when the risk cannot be efficiently managed within the present unit.
- Close — A risk can be closed when the risk drivers become sufficiently weak and do not pose any more threats.
Track — Tracking occurs to assess the efficiency of mitigation plans or look out for new risks.
Control — Action is taken during this step if a mitigation plan fails to address the risk drivers.
Document and Communicate — Proper documentation facilitates knowledge sharing, management, and communication with peers, other organisational units, or upper management.

6. Why Is Risk Management Difficult

The following paragraph summarises NASA’s position on risk management:

The RIDM process acknowledges human judgment’s role in decisions and that technical information cannot be the sole basis for decision-making. This is not only because of inevitable gaps in the technical information but also because decision-making is an inherently subjective, values-based enterprise. In the face of complex decision-making involving multiple competing objectives, the cumulative judgment provided by experienced personnel is essential for effectively integrating technical and non-technical factors to produce sound decisions.

— NASA Risk-Informed Decision-Making Handbook

The above passage clearly emphasizes the subjective nature of human judgment in any decision-making process.

It also underlines the importance of embracing values within the organization and their influence on the final decisions so that technical considerations are not solely responsible for selecting the preferred solution.

The passage also acknowledges the presence of “inevitable technical gaps” or uncertainty in the inputs, leading to probability distributions and likelihoods of outcomes rather than firm values.

Risk is usually defined in statistical terms; it is nothing more than a set of possible future scenarios, their likelihood, and their impact on the performance objectives should they materialise.

In contrast, uncertainty applies to situations where the outcomes of a system unfolding and its causal structures are not fully understood (see the article on complexity and complex systems and Fooled by Randomness by Nassim Taleb for more insights). In such cases, risk cannot be modelled using traditional statistical formulae.

The scepticism surrounding risk management is based on the premise that risk and uncertainty are not distinguished in risk management practices. This confusion manifests itself through the following (incorrect) assumptions:

Real-world risk is Gaussian, whereas it’s fat-tailed.
Risk probability distributions are static and known a priori. The reality is that risk emerges with time and is hidden.
Threats are exogenous (external to the system or organisation). The truth is that they are also endogenous (from within the system, created by its complex forces).

While some risk certainly follows a normal distribution and can be determined at the start of the project, the ones that make or break a project are probably fat-tailed, exogenous and endogenous and emerge over time. (History is not a gradual, incremental process but is dictated by the arrival of Black Swans).

7. Conclusion

We perform Risk Management unconsciously, hundreds of times every day, through built-in mechanisms that usually help us navigate difficult and complex situations using simple heuristics.

Unfortunately, judgment solely based on intuition is not enough when collaborating on large projects in complex hierarchies; a sophisticated and efficient methodology needs to be applied to succeed.

Software projects revolve mainly around integrating mature products through standardised interfaces, delivering standard functionality. The risk of going south in such endeavours is much smaller than in projects where international, multidisciplinary teams collaborate on mega-projects, such as NASA’s missions.

However, a spectrum of methodologies exists between intuitive individual judgements and NASA’s Risk Management, a highly sophisticated and tailored framework.

It is up to every team or organisation to decide which level of risk management rigour best applies in their specific context. This should not be difficult once you have the necessary tools in your pocket.

The objective of this article was to provide a basic risk management toolset and offer some references for further review as needed.

9. References

NASA Systems Engineering Handbook
NASA Risk Management Handbook
NASA Risk-Informed Decision Making
Antifragile, Fooled by Randomness, The Black Swan — Nassim Taleb

Operational Excellence

Organisational Culture

Organisational Processes

Project Delivery

Soft Skills for Engineers

Computer Science and Engineering

Quick Links