Stability
- Stability
- the subtype of the reliability
quality factor
representing the degree to which mission-critical
services continue to be delivered during a given time period
under a given operational profile regardless of any failures whereby the:
- Failures limiting the delivery of mission-critical services
occur at unpredictable times
- Root causes of such failures are difficult to identify efficiently
As illustrated in the preceding figure, Stability is part of the following inheritance hierarchy:
- Type: Concrete
- Superclass: Reliability
- Subclasses:
- Hardware Stability
- Software Stability
The typical responsibilities of Stability are to:
- Measure the degree to which stakeholders can depend on an application or component
to continue to provide essential mission-critical services in spite of failures
- Support the analysis and specification of
stability requirements.
- Provide a foundation for evaluating the quality of an architecture.
Stability is typically decomposed into the following aggregation hierarchy of subfactors:
- Stability Protection
- Stability Loss Detection
- Stability Loss Reaction
Stability is typically measured in terms of the:
- Mean Essential Service Lost Time (MESLT)
for a given Essential Service (ESi) and Operational Profile (OPj),
whereby ESLT is defined as the Required Essential Service Duration (RESD) times Unavailability,
whereby Unavailability (i.e., 1 - Availability) is the ratio of the
Mean Time To Repair (MTTR) from loss of the essential service to the sum of the
MTTR and the Mean Time To Failure (MTTF) causing loss of the essential service. Therefore,
MESLT(ESi,OPj) =
RESD(ESi,OPj) * (MTTR(ESi,OPj)/(MTTF(ESi,OPj) + MTTR(ESi,OPj)))
- Mean Time between Major Failures (MTMF) for a given Operational Profile (OPi),
whereby MTMF is defined as the mean period of time that the application continues to
provide essential mission-critical services under stated conditions
- Maximum permitted number of major failures per unit time under stated conditions
for a given operational profile
Typical mechanisms for achieving Stability include:
- Graceful degradation:
- Essential service-based degraded modes of operation
- Service prioritization
- Mechanisms to limit failure propagation:
- Redundancy
- Isolation middleware
- Time and space partitioning and isolation
- Pattern based:
- Communications protocols
- Component roles (e.g. to discourage circularities and to support defensive programming)
- Policies for stale or missing data
- Deterministic execution (e.g., cyclic executive or real-time operating systemm
- Coding standards:
- Defensive programming
- Exception handling
The following guidelines have been found to be useful with regard to Stability:
- The scope of stability can be the
system,
application, or
component
- A system is unstable to the extent that it fails to perform its mission-critical
functions. Reliability is measured in terms of mean time between failures (MTBF)
and the measurement includes all failures, even trivial ones. On the other hand,
stability is measured in terms of mean time between critical failures (MTBCF).
Thus, stability failures are big failures with major negative consequences.
For example, the failures that would lower stability of avionics systems
would be failures preventing flight testing (during development),
failures preventing the delivery of ordinance (during actual usage),
failures causing the loss of the airplane, and failures endangering
the life of the pilot.
- Other aspects of stability are the unpredictability of failures
and difficulty in identifying their root causes. Thus, a system is less stable the more often that it crashes at unpredictable times and in unexpected ways that are difficult to reproduce or fix. Thus, stability problems are more difficult to discover during normal functional testing. Classic examples of the kinds of failures (that if their impact is sufficient to impact mission success) that lower stability are failures due to concurrency defects (e.g., race conditions).
There is an overlap to another attribute: Robustness, which is the ability to operate under abnormal conditions. Stability is how fragile a system is in terms of mission-critical failures that are difficult to predict and repair
- Stability only makes sense if services have been prioritized to identify
the essential mission-critical services.
- Stability only makes sense in the context of a specific operational profile because
software failures causing loss of essential services are not scattered randomly
in time but rather depend on the execution of specific code which is a function
of the current operational profile. The operational profile can be defined
as a specific distribution of use case scenarios including potentially multiple
use cases and both normal and exceptional paths through those use cases.
- High levels of stability are difficult to verify.
One can estimate the stability of an application or
component based on ‘long term’ reliablity
testing prior to delivery, but the duration of the testing
is often too limited by delivery schedules to allow very accurate estimation.
- One can use statistics to estimate the stability of
an application based on the stability of its components
(as well as certain assumptions about the independence of
the failures).
- Stability is closely related to other dependability quality factors,
especially continuity: