Technical Risk Management and Decision Analysis — Introduction and Fundamental Principles
I could not find a better way to start an article on Risk Management than by quoting the opening lines of Donald Lessard and Roger Miller’s 2001 paper that, briefly but lucidly, summarizes the nature of large engineering endeavours. It goes like this:
Large engineering projects are high-stakes games characterized by substantial irreversible commitments, skewed reward structures in case of success, and high probabilities of failure.
— Understanding and Managing Risk in Large Engineering Projects
This article leans heavily on three handbooks that NASA has published (see References section for links) in the past decade or two.
These handbooks thoroughly discuss Risk Management for large engineering projects, and I believe they would immensely benefit people in the software business.
While a rudimentary risk management system is sufficient in most cases, especially for small projects, it is always an asset to possess adequate knowledge and enough depth and rigour when treating any topic that can adversely impact the organization’s capabilities of delivering successful projects.
This approach focusing on a rigorous, in-depth, and scientific education in the software business is a primary tenet of Operational Excellence.
2. Project Risk
2.1 Definition of Risk
Intuitively, when considering risk, we focus on the probability of adverse events while neglecting their impact. Risk, however, is the possibility of something bad happening, compounded by its consequences.
The NASA Risk Management Handbook provides the following definition of risk in Systems Engineering.
Risk is the potential for performance shortfalls, which may be realized in the future, with respect to achieving explicitly established and stated performance requirements.
As we shall see in the next section, risk can cover three domains of performance-related objectives.
In this sense, performance measures how well we have achieved our goals (quality, technical, budgetary, schedule, requirement satisfaction). It is not strictly tied to the system’s performance (processing speed, throughput, capacity, load-bearing).
As applied in Risk Management, risk items are characterized as a set of triplets made up of the following elements:
- One or more scenarios lead to a degraded performance concerning a specific objective.
- Examples of impacted objectives can be budget overruns, delayed schedules, poor system performance, financial loss, loss to competition, reputational loss, or non-compliance to regulatory mandates.
- Generally, a scenario is a sequence of credible events that specify a system’s evolution from one state to another. In risk management, they identify how a system can evolve to an undesirable state from its current one.
- The likelihood of those scenarios occurring: this measure can be qualitative or quantitative.
- Suppose historical data or statistical models (for example, relating reliability, security, or capacity with independent variables such as hardware size, particular configuration or setup) exist. In that case, they can be used to derive a numerical measure of the likelihood provided that the conditions between the past and the present have not changed.
- Otherwise, Detailed Logic Modeling Estimation can be used if data is unavailable. This method breaks the system down into logical parts in a top-down approach. Then, a bottom-up approach is applied to quantify the risk on every module across the different levels for every scenario or event. The evaluations collected on the component level are then integrated with mathematical or Boolean logic to evaluate overall risk.
- The consequences or impact should any of the scenarios occur. The severity can be qualitative or quantitative. For example:
- An online processing system can see its performance degraded by a percentage from zero to 100. A budget can be overrun by a certain amount.
- On the other hand, the reputational damage incurred can be measured on a discreet scale of, for example, one to five.
If any uncertainties exist, they are typically included when evaluating the likelihood of a scenario and its consequences by providing lower and upper bounds.
2.2 Types and Sources of Risk
Four types of risk can be associated with implementation projects.
- Budgetary risk is associated with the inability to keep the fiscal requirements satisfied. There are two main scenarios for why that can happen. The solution design, development, verification, and validation estimates were inaccurate in the first scenario. In the second scenario, budgets are overrun due to the inability to meet technical (novel or immature technology) or schedule constraints (unforeseen staff absence, poor processes, lack of skill, removal of key resources).
- Schedule risk can be related to inadequate planning or resource allocation, unreasonable timelines, poor time-management practices, inaccurate effort estimation processes, or a lack of project management skills.
- Technical Risk: Two primary sources of technical risk can be identified. The first is related to poor production processes in the different stages of the SDLC: analysis, design, development, testing and quality assurance. The other reason can be technological challenges such as immature technology stacks, poor technological decisions, and the lack of specialized skills.
- Programmatic risk is associated with areas outside the project manager’s control, such as regulation or compliance updates, shifting priorities in the organization, unforeseen technological or schedule or cost-associated risks, supplier delays, poor quality third-party products, or any other external dependency.
In general, a Risk Management process will attempt to respond to some of the primary issues that have, historically, derailed IT programs. In addition to what has already been mentioned above, we summarize these issues as follows:
- The mismatch between stakeholder expectations and the actual effort and resources required to implement the project and address the risks involved
- The underestimation of risks involved and the overpromising by the decision-makers to the stakeholders
- The miscommunication of risks associated with different decisions or alternatives
3. Risk Management Strategy
3.1 What Is Risk Management
Lessard and Miller’s paper stated that “successful projects are shaped with risk resolution in mind” and “successful sponsors[…] embark on shaping efforts to influence risk drivers.
Leaning on the statement above, we can think of risk management as:
- Recognizing the interaction between external risk drivers and internal managerial choices
- Identifying leverage points in a project and the subsequent application of effort on these leverage points to influence risk drivers.
The objective of risk management is not the total elimination of risk but the identification of viable projects where future success and revenue generation is possible.
3.2 General Considerations
The following are the primary considerations when designing a sound risk management strategy for implementing complex software solutions.
- Solution design and the concepts they are based upon are developed sufficiently, to the right amount of detail, to enable a correct assessment of the risk involved.
- Subsystems of the solution can be profiled, and those deemed riskier can benefit from the additional design, verification, and validation efforts (simulations, analysis, proof of concepts).
- A threat (and opportunity) list is maintained throughout the project, and the cost associated with the aggregate risk is calculated. Risk entries are monitored, and their impact, likelihood, and cost are updated periodically. A mitigation plan can be implemented to deal with the risks. Risks are prioritized, and the highest priorities can claim easier access to the project reserves.
- Identifying the risk tolerance levels for every performance measure is crucial for all stakeholders. Recall that a performance measure is a value that tells us how close or far we are from the best possible position we have for a specific objective. The risk tolerance will allow us to set a lower boundary for each goal. It also allows a fair comparison between alternative solutions. An ideal solution will be insensitive to changes in risk tolerance levels.
3.3 A Successful Risk Strategy
According to the NASA Risk Management Handbook, a successful risk management strategy must be based on the following tenets:
- Qualitative assessments must make way for quantitative analyses where applicable and practical. This shift must apply to all levels of risk management, from individual risks to aggregate ones and from handling the impact of a single adverse event to managing its primary sources and drivers.
- The risk management processes should be systematic, well-developed, and integrated into a formal risk analysis framework. The outcomes of these processes should be communicated efficiently to decision-makers to make well-informed decisions.
- The risk management strategy and the decision-making process must involve the right stakeholders and consider their risk tolerance levels when considering multiple alternatives.
- The NASA handbook acknowledges the ease with which decisions can be made by individuals working solo due to highly-developed decision-making methods. On the other hand, challenges quickly arise inside complex organisational hierarchies. Therefore, any risk management strategy should consider the complexity of an organization’s structure, organisational culture, and key stakeholder policies and plans.
3.4 When Is Risk Management Applied
NASA uses the Systems Engineering Engine to execute its mission projects, an “engine” well-suited for complex engineering endeavours. The diagram below shows a simplified version of that engine (ref. NASA Risk Management Handbook Figure 1).
Technical risk management falls under Control; the stakeholder requirements, interfaces, data, and configuration are locked and monitored for change. If an update needs to be applied, it goes through a change management process, and risk is reassessed.
However, we use a combination of Agile or Waterfall-based delivery methodologies in software initiatives with the Software Development Lifecycle (SDLC) concepts, neither of which discuss risk or prescribe a risk management strategy.
That is not to say that no risk management occurs in software projects. Risk management is often done on the project or program management level, without much interaction with the technical stakeholders and a formal and rigorous framework.
Only when something goes wrong directly impacting the project’s timeline or budget is a flag raised, usually for the first time. In my 15+ years of experience in the industry, I have never participated or was aware of a risk identification session conducted.
So how is risk management applied in software projects? Can any specific and fundamental principles or best practices be successfully implemented?
If you think closely about Agile, you will find it has a unique attitude towards change. In effect, Agile’s second principle goes like this:
Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.
— Agile Manifesto
Agile practitioners are required to welcome change, even when it’s late in the process; there is no mention of risk, risk management, or mitigation plans.
The issue is that Agile is not always applicable, and even if you are applying Agile, it is not wise to assume all risks. This statement is even truer when you have large programs with multiple streams and coordinated activities.
The interdependencies between the different streams can be very strong for such projects, making them highly sensitive to potential threats. In such cases, an adverse event in one stream may soon cascade upstream, downstream, or sideways with devastating effects.
Change must be controlled and monitored in megaprojects (multiple streams, complex solutions, global presence, and multi-cultural teams), and risk management must be applied to prevent failures.
A risk management framework needs to be implemented within the SDLC and project management space to be effective.
Although the implementation will undoubtedly be context-dependent, we believe the ideas discussed in this article remain valuable.
3.5 The Iron Triangle
The Iron Triangle is a great way to visualize the interactions between three influential risk drivers in any program: cost, schedule, and technical.
Two main ideas govern these interactions:
- Programmatic constraints such as limited budgets, minimal time-to-market, and high-quality standards restrict the range in which cost, schedule, and technical risk can maneuver.
- Any attempt to reduce the risk in one of the three areas will increase the risk in at least one of the rest.
- Budget cuts usually result in testing and quality assurance sacrifices, which might induce safety and quality problems.
- Tight timeframes may require extra resources leading to increased costs. It might also mean cutting down on enabling and supporting activities such as solution design, testing, and quality assurance.
Program managers, system engineers, and risk analysts work together to find an optimal distribution of risk among the three categories that lead to a viable solution.
3.6 Avoiding Unnecessary Risks
To illustrate these concepts further, let’s try and marry them with a few of the Operational Excellence principles we advocate for software development.
Principle 2 discusses process management and governance, without which the project will suffer inevitable losses in all three spaces: cost, schedule, and quality.
Lean, efficient, and effective production processes mean larger profit margins, more room for continuous improvement, a competitive edge, increased value proposition, and higher quality products.
Superior processes and great talent allow you to achieve the best quality for the lowest budgets and in the shortest timeframes.
Principle 3 stresses the value of quality deliveries, while principle 5 discusses the vitality of focusing on business value. The central theme here is that there is a limit that, although hard to determine, exists beyond which the project is not viable.
Accepting non-viable projects is risky for your business. It diverts valuable resources to non-profitable projects, distracting your team, preventing them from focusing on strategic initiatives, and accumulating large amounts of technical debt by accommodating unreasonable timeframes.
The danger arising from these scenarios is due to their highly technical nature and long-term impact that is not discernable and evident in the present.
Finally, principle 6 provides a method for a curtailing novel or immature technology risk. Fortunately, the vibrant and diverse technology stacks and existing frameworks in today’s software world make this endeavour straightforward.
4. Risk-Informed Decision Making
The NASA Systems Engineering Handbook distinguishes between Risk Management and Risk-Informed Decision-Making (RIDM). NASA’s risk management approach combines Continuous Risk Management (CRM) and Risk-Informed Decision-Making (RIDM).
CRM monitors and tracks risk throughout the project’s implementation, while RIDM is used to help increase awareness of risk during critical decision-making during execution.
RIDM helps decision-makers choose between two competing alternatives by comparing the risks involved. The objective is to help avoid redesign and rework activities that jeopardise costs and schedules.
RIDM is invoked for key decisions such as architecture and design decisions, make-buy decisions, source selection in major procurements, and budget reallocation (allocation of reserves), which typically involve requirements-setting or rebasing of requirements.
— NASA Risk-Informed Decision-Making Handbook
During a product’s implementation lifecycle, critical decisions are typically made that require a cost/benefit analysis and balancing risks. RIDM naturally comes in under such situations. These decisions are characterised by the following:
- High Stakes — High stakes may include significant costs, an irreversible commitment of resources, safety, and business survival.
- Complexity — The future implications of the decisions are not evident and require detailed analysis.
- Uncertainty — Uncertainty in inputs may lead to uncertain outcomes, requiring risk analysis for various possibilities.
- Stakeholder Management — Stakeholder management is essential when numerous stakeholders have different preferences, risk tolerance levels, policies, and influence.
4.2 The RIDM Process
The NASA Risk-Informed Decision-Making Handbook is 256 pages long, and the following few paragraphs will not do it full justice.
We recommend you go through the book for a comprehensive understanding as the below will consist only of a high-level illustration of the process.
The Risk-Informed Decision-Making process consists of three stages:
- Stage 1: Identification of alternatives — During this stage, several alternative solutions that satisfy the performance measures associated with the performance objectives are compiled. Performance measures that fall within an acceptable range are called constraints, project cost being one example. While performance measures are being created, stakeholder analysis is carried out simultaneously and influences the allowable values of the performance measures.
- Stage 2: Risk Analysis of Alternatives — Each alternative is weighed against the anticipated performance objectives we think it will produce, considering any uncertainties that may influence the input. The outcome of this process is a probability distribution of each performance measure for each alternative. The decision must be robust to model perturbations and reasonably foreseeable information.
- Stage 2: Risk-Informed Alternative Selection — A normalized method must be established to compare multiple alternatives. This normalization is required since each option’s performance measure probability distribution can differ. A fair comparison method that gives equal chances for each alternative must be used for the process to be consistent.
5. Continuous Risk Management
NASA applies Continuous Risk Management (CRM) to its overall Risk Management Strategy. The relationship between CRM and RIDM is involved, and we recommend you refer to the handbook for an in-depth discussion.
For this section, suffice that Risk-Informed Decision-Making will help select the alternative. With that alternative comes a set of risks that need to be managed and performance objectives that need to be achieved; these risks and goals will form the input into CRM.
The CRM process consists of six stages:
- Identify — In this stage, a set of risks are identified. Individual risk is defined in the handbook by four components:
- Condition — A condition describes the circumstances that would lead to the realization of the risk scenario.
- Departure — A departure describes a specific change in the baseline project plan facilitated by the condition.
- Asset — An asset represents a project resource adversely impacted by the risk.
- Consequence — The consequence lists the unfavourable impacts on the organization’s capabilities to meet its objectives.
- Analyze — During this phase, we estimate the likelihood of the risk, its potential impact, and any uncertainties involved. It also identifies at which stage of the project the risk first emerged and catalogues any new threats.
- Plan — During the planning phase, the stakeholders decide what action to take for each risk. The handbook lists five possible options:
- Accept — The risk is deemed acceptable, and nothing further is done.
- Mitigate — Actions are taken to influence the risk drivers.
- Watch — Data is collected on the risk drivers.
- Research — A typical recourse when significant uncertainties are involved, and stakeholders believe can be reduced with further investigations.
- Elevate — A form of escalation to higher echelons when the risk cannot be efficiently managed within the present unit.
- Close — A risk can be closed when the risk drivers become sufficiently weak and do not pose any more threats.
- Track — Tracking occurs to assess the efficiency of mitigation plans or look out for new risks.
- Control — Action is taken during this step if a mitigation plan fails to address the risk drivers.
- Document and Communicate — Proper documentation facilitates knowledge sharing, management, and communication with peers, other organisational units, or upper management.
6. Why Is Risk Management Difficult
The below paragraph summarizes NASA’s position on risk management:
The RIDM process acknowledges human judgment’s role in decisions and that technical information cannot be the sole basis for decision-making. This is not only because of inevitable gaps in the technical information but also because decision-making is an inherently subjective, values-based enterprise. In the face of complex decision-making involving multiple competing objectives, the cumulative judgment provided by experienced personnel is essential for effectively integrating technical and non-technical factors to produce sound decisions.
The above passage clearly emphasizes the subjective nature of human judgment in any decision-making process.
It also underlines the importance of embracing values within the organization and their influence on the final decisions so that technical considerations are not solely responsible for selecting the preferred solution.
The passage also acknowledges the presence of “inevitable technical gaps” or uncertainty in the inputs, leading to probability distributions and likelihoods of outcomes rather than firm values.
Risk is usually defined in statistical terms; it is nothing more than a set of possible future scenarios, their likelihood, and their impact on the performance objectives should they materialize.
In contrast, uncertainty applies to situations where the outcomes of a system unfolding and its causal structures are not fully understood (see article on complexity and complex systems and Fooled by Randomness by Nassim Taleb for more insights). In such cases, risk cannot be modelled using traditional statistical formulae.
The scepticism surrounding risk management is based on the premise that risk and uncertainty are not distinguished in risk management practices. This confusion manifests itself through the following (incorrect) assumptions:
- Real-world risk is Gaussian, whereas it’s fat-tailed.
- Risk probability distributions are static and known a priori. The reality is that risk emerges with time and is hidden.
- Threats are exogenous (external to the system or organization). The truth is that they are also endogenous (from within the system, created by its complex forces).
While some risk certainly follows a normal distribution and can be determined at the start of the project, the ones that make or break a project are probably fat-tailed, exogenous and endogenous and emerge over time. (History is not a gradual, incremental process but is dictated by the arrival of Black Swans).
We perform Risk Management unconsciously, hundreds of times every day, via built-in mechanisms that usually help us get by in difficult and complex situations using simple heuristics.
Unfortunately, judgement solely based on intuition is not enough when collaborating on large projects in complex hierarchies; a sophisticated and efficient methodology needs to be applied to succeed.
Software projects revolve mainly around integrating mature products through standardized interfaces, delivering standard functionality. The risk of going south in such endeavours is much smaller than in projects where international, multi-disciplinary teams collaborate on mega-projects like NASA’s missions.
However, a spectrum of methodologies exists between intuitive individual judgements and NASA’s Risk Management highly sophisticated and tailored framework.
It is up to every team or organization to decide which level of risk management rigour best applies in their specific context. This should not be difficult once you have the necessary tools in your pocket.
This article’s objective was to provide a basic risk management toolset and some references to look at if needed.