Robustness
- Robustness
- the defensibility quality factor
representing the degree to which essential mission-critical services continue to be provided in spite of
potentially harm-causing events or conditions
As illustrated in the preceding figure, Robustness is part of the following inheritance hierarchy:
- Type: Abstract
- Superclass: Defensibility
- Subclasses:
- Business Robustness,
which is the degree to which a business
enterprise
continues to function properly under abnormal conditions or circumstances.
- System Robustness,
which is the degree to which an
system
continues to function properly under abnormal conditions or circumstances.
- Application Robustness,
which is the degree to which an
application
continues to function properly under abnormal conditions or circumstances.
- Hardware Robustness,
which is the degree to which a
hardware component
continues to function properly under abnormal conditions or circumstances.
- Software Robustness,
which is the degree to which a
software component
continues to function properly under abnormal conditions or circumstances.
The typical responsibilities of Robustness are to:
- Model the degree to which graceful degradation occurs.
- Measure the degree to which stakeholders can depend on the continuation of essential mission-critical
services in spite of potentially harm-causing events or conditions
- Support the analysis and specification of
robustness requirements.
- Provide a foundation for evaluating the quality of an architecture.
As illustrated in the following graphic, robustness can be
decomposed into the following aggregation hierarchy of subfactors:
-
Environmental Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of potentially harm-causing environmental conditions(e.g., salt spray causing
corrosion or radiation randomly changing the value of a bit within memory).
-
Error Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the presence of erroneous input (e.g., incorrect, stale, or out-of-order data).
Note that erroneous input is typically due to human error although it may also be due to sensor failure,
timing delays, etc.
-
Fault Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the presence or execution of defects, whereby a defect (also known as
fault, bug) is an underlying flaw in a work
product (i.e., a work product that is inconsistent with its
requirements, policies, goals, or the reasonable expectations
of its customers or users). Note that a defect may or may not cause a
failure depending on whether or not the defect is executed
and whether or not exception handling prevents the failure from occurring.
-
Failure Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the occurrence of failures, whereby a failure is the execution of a defect that
causes an inconsistency between an executable work
product’s actual (i.e., observed) and expected (e.g., specified) behavior.
- Fail Safety is the degree to which
something places itself into a safe operating mode in the event of specific failures.
- Fail Security is the degree to which
something places itself into a secure operating mode in the event of specific failures.
- Fail Softness is the degree to which
something continues to provide partial operational
capabilities (possibly in a degraded mode) in the event of specific failures.
-
Safety Incident Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the occurrence of safety incidents (e.g., accidents and near misses).
-
Security Incident Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the occurrence of security incidents (e.g., security attacks and probes).
-
Survivability Incident Tolerance,
which is the degree to which essential mission-critical services continue to be provided
in spite of the occurrence of survivability incidents (e.g., military attacks).
Robustness is typically measured in terms of the:
- Operational availability
of one or more specific essential mission-critical services
(e.g., functions/features/use cases/use case paths)
while suffering from a specific [level of] harm due to an accident or attack
- Mean or maximum operational availability for a specific category of essential mission-critical services
while suffering from a specific [level of] harm due to an accident or attack
- Mean loss time (i.e., mission-duration * (1 - operational-availability)
of one or more specific essential mission-critical services
as a result of specific harm due to accident or attack
- Percent of loss of essential mission-critical services detected and reported.
Typical mechanisms for implementing robustness include:
- Assertions:
- Preconditions
- Postconditions
- Invariants
- Back-up of application state and data
- Communications re-routing
- Degraded modes of operation
- Disaster avoidance and recovery
- Exception handling
- Fault-tolerant architectures:
- Dual-channel architecture, which provides a standby or backup channel in case the primary channel
fails:
- Homogeneous dual-channel architecture, which uses identical channels.
- Heterogeneous dual-channel architecture, which uses different channels
(e.g., different designs or different implementations).
- Multi-channel voting architecture, which includes at least 3 simultaneously executing
channels with a “voter” that:
- Compares the outputs from the different channels.
- Delivers the majority output.
- Fail-stops any minority channels.
- Dual-dual architecture, which consists of two pairs of dual-channels:
- A primary dual-channel architecture.
- A secondary standby dual-channel architecture.
- Monitor-actuator architecture, which consists of:
- A primary channel.
- A monitor channel.
- A secondary backup shutdown channel.
- Hot and cold failover to other preestablished alternative systems, applications, or sites
- Graceful degredation
- Monitoring:
- Actuation monitoring, which monitors the system’s control of its actuators:
- End-around monitoring, which checks the correctness or reasonableness of the system’s signals to its
actuators.
- Wrap-around monitoring, which checks the correctness or reasonableness of the behavior of the
system’s actuators.
- Actuation-results monitoring, which uses one or more independent sensors to verify that the system
actually achieves it’s intended results.
- Shutdown monitoring, which monitors for the internal results of faults that can cause externally-visible
failures so that the system can be properly (e.g., safely) shutdown.
- Redundancy:
- Hardware (e.g., actuator, network, sensor, server, storage) redundancy
- Functional redundancy (multiple channels)
- Software redundancy
- Roleback to previously valid state
- Training
- Warning signs (visual and auditory)
The following guidelines have been found to be useful when producing robustness requirements:
- Robustness is related to reliability and operational
availability because an application or component cannot be
reliable or available if it is not robust.
- Robustness is often critical for business and safety critical applications.
- Robustness involves the resistence to, detection of,
and recovery from the loss of essential services.
- Robustness ensures that service
either fails gracefully or else continues to continues to be provided
(possibly in a degraded mode), even though certain components
have been intentionally damaged or destroyed.
- Also confusing
physical protection with continuity. Continuity deals with
continued functioning after an attack, whereas
physical protection requirements deals with the
protection of components. Physical protection is
typically a prerequisite for continuity.