Resilience Principles: The Search for Optimum Combinations

December 19, 2012
Articles, CBRNE, Communication & Interoperability, Critical Infrastructure, Emergency Management, Emergency Medical Services, Fire, Hospitals, Law Enforcement, Science & Technology, Terrorism, Transportation
Scott Jackson

Articles, CBRNE, Communication & Interoperability, Critical Infrastructure, Emergency Management, Emergency Medical Services, Fire, Hospitals, Law Enforcement, Science & Technology, Terrorism, Transportation
Scott Jackson

Most current uses of the term “resilience” in relation to engineered systems reflect the fact that a system can return to a close approximation of its original function even after disruption by a threat. However, the resilience of a system depends on many other factors as well, including the outcome desired and the magnitude and type of threat involved. Moreover, different stakeholders may desire different outcomes in cases where total recovery is either not possible or not practical. When designing the system, therefore, the designer must consider many scenario-dependent factors – including the practicality and affordability of several potential solutions to various real and/or potential problems.

Resilience is often discussed in relation to infrastructure systems, with elements including but not limited to organizational factors (police and fire departments), physical factors (dams, bridges, warehouses, and office buildings), and procedural factors (fire protection protocols and law-enforcement requirements). Resilience applies to all three types of these elements – and to their integrated composites, including not only systems per se but also systems of systems (SoS). The latter are systems composed of two or more components, each established under different leadership and developed without the specific intent of interfacing with other systems.

The various constituent systems – fire protection, law enforcement, and power distribution systems, for example – have been and are separately developed and operated, giving rise to what are known as emergent properties. The interactions between these systems often cause what are called cascading failures. For example, the public water supply in New York City was damaged on 9/11 by debris from the World Trade Center attacks, and that damage resulted in the flooding of the New York Stock Exchange.

By modeling the infrastructure with all of the component systems, and with all of their various inputs and outputs taken into consideration, decision makers may be able to anticipate and plan for infrastructural vulnerabilities that can lead to similar cascading failures in future emergencies. At the local level, the practitioner may be able to ensure that there is adequate physical separation between such individual components as electrical cables, water pipes, and communications lines. At the higher SoS level, resilience would require more planning to ensure that each node of the infrastructure has access to multiple sources of the water, electrical, and other services.

Nondeterminism: A Few Specifics

One key characteristic of resilience is its “nondeterminism,” which means that its future state of possible recovery cannot be quickly or easily calculated through the use of standard mathematical algorithms. The reason for this nondeterminism is the unpredictability of both its time state – i.e., exactly when a threat will strike, as well as its magnitude and type – and its physical state, including the severity, quantity, and types of damage the system suffers. Given the uncertainties of these factors, it is extremely difficult and often impossible to know in advance how the system will respond to certain types of emergencies.

Although lacking the exact information, practitioners can nonetheless create models of certain systems with specific configurations incorporated to facilitate the modeling of various hypothetical threats and scenarios. So-called Monte Carlo methods, which are based on repeated random sampling, can be used to model the effect of a statistically varied distribution of threat types and magnitudes on the system and, by doing so, develop a rough statistical approximation of the anticipated effects. The results of such simulations usually help the practitioner draw at least a few reasonable conclusions about the system’s overall resilience.

Abstract Principles & Concrete Examples

A number of seemingly abstract principles may be applied to any system in any domain, but the applications of those principles require the design of specific concrete solutions that are both domain- and scenario-dependent. As suggested above, the concrete solutions implemented can be physical, organizational, or procedural in nature – and can be modeled in enough detail to make reasonably accurate predictions about their future effectiveness. The abstract principles used typically embody the essential characteristics that will be found in any concrete solution that implements the specific principles involved. For example, the principle of physical redundancy requires two independent and parallel branches, so that concrete solutions implementing that principle will have two independent branches.

A paper published on 19 October 2012 in the Systems Engineering journal included a long and comprehensive list of abstract principles, gathered from various sources. The following principles are adapted from that list:

Absorption: The system is able to withstand the disruption level specified. (Example: A levee is able to withstand a 100-year-flood incident.)
Physical redundancy: The system consists of at least two identical and independent branches. (Example: San Francisco is served by three water systems.)
Functional redundancy: The system includes at least two functionally different branches. (Example: There are several ways – by car, train, aircraft, or boat, for example – to evacuate people from a coastal city.)
Layered defense: There is no single point of failure that threatens the entire system. (Example: The Los Angeles Metrolink system now has two separate layers of defense available – positive train control and cab monitoring.)
Humans in the loop: The system has enough capable people immediately available to handle unanticipated disruptions. (Example: A nuclear power plant.)
Reduced complexity: The system is characterized by “minimum complexity.” (Example: Micro-grids are being considered to reduce the growing complexity of current power grids.)
Reorganization: The system is capable of quickly restructuring itself after a major disaster/disruption. (Example: The New York City power system was restructured following the 9/11 attacks.)
Repairability: The system is capable of being repaired. (Example: The Hubble space telescope was actually repaired in orbit.)
Localized capacity: Each node of the system is capable of independent operation. (Example: Hospitals typically have independent generators to provide electrical power.)
Loose coupling: The system has flexibility between nodes to reduce the possibility of cascading failures. (Example: Power grids rely on human operators to reduce the possibility of cascading failures.)
Drift correction: The system is able to anticipate and correct for an oncoming threat or hidden flaw. (Example: Positive train control detects oncoming trains and takes whatever actions are needed to prevent collisions.)
Neutral state: The system is capable of maintaining a neutral state to deal with disruptions. (Example: A ban on “self-dispatching” would prevent first responders from entering buildings without proper authorization.)
Internode interaction: The system is able to maintain cohesion through the use of effective communications, cooperation, collaboration, and command and control operations. (Example: Following the 2005 bombings in the London subway system, survivable communications systems were installed to maintain cohesion during and after future incidents.)
Reduce hidden interactions: The system has no harmful interactions among its parts. (Example: A detailed review among sub-organizations reduces hidden and/or unforeseen interactions that might cause or lead to partial or total failure of the system.)

The Inherent Vulnerabilities of Revered Principles

The 14 principles listed above each have inherent vulnerabilities – the potential for either harm or ineffectiveness – if they are not fully and effectively implemented. This is particularly true for principles relying on human involvement. Although there are two principles – absorption and physical redundancy – for which the chance for harm is relatively low, they also possess certain vulnerabilities.

In applying the absorption principle, for example, practitioners must be sure there are: (a) no degradation of capability caused by aging or poor maintenance; (b) no latent faults – many of which can be detected only through rigorous audits and reviews; and (c) a robust system that can withstand threats over a wide variation in conditions. Similarly, the physical redundancy principle has vulnerabilities, including: (a) the possibility that, when two branches of the system are not truly independent, a failure in one branch can cause a failure in the other; (b) the likelihood that, if two software systems are identical, a hidden flaw in one system may also exist in the other; and (c) in organizational systems, the use of redundant communications systems almost certainly results in the transfer of ambiguous and/or incomplete information.

In many cases, two or three resilience principles must be invoked in the appropriate combinations. The specific “linked” principles depend on either the anticipated scenario problems or on inherent vulnerabilities of the primary principle. In many major disasters, for example, communication and other functions of the internode interaction principle may not survive the threat event. As a result, the absorption, physical redundancy, and/or functional redundancy principles may be selected to ensure the survival of the system functions.

Another example is that the reduced complexity principle almost always involves restructuring the system – which means, of course, that the reorganization principle may have to be invoked. Many principles, such as the neutral state and the internode interaction principles, almost always require human intervention, thus the human in the loop principle can be logically linked.

Resilience – Expensive & Inexpensive Alternatives

Some resilience solutions – redundancy, for example – are expensive by nature. The building of redundant aqueducts or dams, even if technologically feasible, would undoubtedly be expensive – but under certain political and/or economic circumstances, the high cost may be justifiable.

For an inexpensive or even no-cost alternative, the closest solution would be one that is merely procedural. Many of these would fall under the internode interaction principle. The least expensive solution would be removing impediments to cooperation among organizations and agencies. A major problem often encountered by emergency management organizations is the phenomenon of “self-dispatching” exemplified by the unauthorized entry of responders (or other persons) into a burning building. This was a problem at both the World Trade Center and the Pentagon on 9/11. Solutions to this problem would be procedural with a very low cost.

High-cost items become worth the price when the adverse consequences projected exceed the cost. As mentioned earlier, San Francisco built a triple-redundant water system after the 1906 earthquake. In some cases, it may be possible to perform a lifecycle cost analysis that includes the cost of the resilience enhancement actions, and then to balance that against the probable cost that would be expected if those actions were not implemented.

The challenge, of course, is that, although the cost of certain preventive actions can be determined, the cost that would have been incurred had the preventive action not been taken is necessarily indeterminate (because not all of the likely, as opposed to possible, costs can be accurately determined). Only statistical methods could be used to assist the decisions. Hence, the issue of cost is sometimes easy, sometimes difficult, and sometimes irresolvable.

Significant contributions to this article were made by Timothy L.J. Ferris, who holds a Ph.D. from the University of South Australia – where, as Associate Director Teaching and Learning in the Defence and Systems Institute, he has responsibility for all of the Institute’s teaching programs. Dr. Ferris also is supported by the International Council on Systems Engineering (INCOSE) as lead author in the Curriculum for Systems Engineering (BKCASE) Project and is the INCOSE Associate Director for Academic Research. He oversees research in the resilience of engineered systems and, with Scott Jackson, is a peer-reviewed author on that subject.

Scott Jackson

Scott Jackson is a lecturer in the Systems Architecting and Engineering program of the University of Southern California (USC) and the author of Architecting Resilient Systems: Accident Avoidance and Survival and Recovery from Disruptions (published in 2010 by John Wiley & Sons, Hoboken, N.J.). He also: (a) is a fellow of the International Council on Systems Engineering (INCOSE) and chair of the INCOSE Resilient Systems Working Group; and (b) represents both INCOSE and USC on The Infrastructure Security Partnership (TISP). Jackson holds an MS in Engineering from the University of California in Los Angeles and is a Ph.D. candidate at the University of South Australia.