Critical Systems: Reliability, Availability, and Safety in Software Engineering

Posted on Nov 12, 2024 in Other subjects

Introduction

Familiarization with system failure

Systems collapse without an apparent reason
Failures can cause damage
We take steps to overcome the lack of confidence
Backups

The higher the confidence level, the higher the cost.

The greater the trust, the lower the performance.

Verification of the states
Disaster Recovery

Trust is sometimes more important than performance.

Critical Systems

When a failure may cause:

Economic losses
Physical or environmental damage
Risks to human life

The cost of a system failure is big or unrecoverable.

In these systems, trust is the most important requirement.

Types of Critical Systems

Security

Injuries, risks to human life, environmental damage

Mission

Not achieving the objective (e.g., controlling an aircraft).

Business

Failure of businesses that use the system (e.g., controlling customer systems in a bank).

Components Subject to Failure

Hardware

Manufacturing errors
End of life
Specification error in design

Software

Mistakes or errors of implementation

Human

Poor system operation

Availability and Reliability

Availability

It is the probability of a system, at a given moment, being operational and capable of providing the required services.

Reliability

It is the probability of failure-free operation during a specified time, in a given environment, for a particular purpose.

Example:

System A: Failed once a year, but each failure takes three days to restart.
System B: Failed once per month, and each failure takes 10 minutes to reboot.

Conclusions:

A is more reliable than B.
B has more availability than A.

Whether one system is reliable or not depends on the context in which it is being applied.

Example of a car accelerating to 100 km/h
Example of using an operating system

Techniques to improve the reliability of a system:

Avoid defects using development techniques that avoid introducing errors.
Detect and eliminate defects by means of tests and validations before the system is actually used.
Tolerance for defects that occurred when the system can handle the situation quickly.

Security

Reflects the system’s ability to operate normally and abnormally without offering risks to people or the environment.

Systems that have this attribute as a key factor are called Security Critical Systems.

Primary Security: Dysfunction directly affects software.
Secondary Security: Dysfunction of the software affects indirectly (e.g., software for engineering calculations).

Security and reliability are related but should be evaluated independently.

Finally, security is how the software is able to generate or avoid generating damage or harm if possible.

Protection

System’s ability to protect against accidental or deliberate invasion.

Damage caused by an invasion:

Interruption of service (affects availability)
Corruption of programs or data (affects reliability and safety)
Disclosure of confidential information (subsequently affects safety, availability, and reliability)

Requirements of Trust

Functional Requirements: They are generated to set the checking capabilities, error recovery, and protection against system failures.
Non-functional Requirements: They are generated to define the reliability and availability necessary to the system.
Requirements for Exclusion: They are defining the states and conditions that should not arise.

Specification Directed to Risks

The specification of critical systems should be directed to risks.

This approach was widely used in safety-critical systems and protection.

The goal of the specification process is to understand the risks (safety, security, etc.) faced by the system and define requirements to reduce those risks.

Stages of Analysis Based on Risk

Risk Identification: Identify potential risks that may arise.
Analysis and Risk Rating: Assess the severity of each risk.
Decomposition of Risk: Decompose the risk to discover their potential causes of origin.
Assessment of Risk Reduction: Define how each risk should be eliminated or reduced when the system is designed.

Specification Directed to Risks

Imagen

Risk Identification

Identify the risks faced by the critical system.

In safety-critical systems, the risks are the hazards that can lead to accidents.
In security-critical systems, the risks are potential attacks on the system.

In identifying risks, you should identify classes of risk and the risk positions in these classes.

Service failure
Electrical hazards

Risks of Insulin Pump

Excessive dose of insulin (service failure)
Insufficient dose of insulin (service failure)
Power failure due to wear of the battery (electrical)
Electrical interference with other medical equipment (electrical)
Bad contact sensor and actuator (physical)
Broken pieces of equipment in the body (physical)
Infection caused by the introduction of equipment (biological)
Allergic reaction to materials or insulin (biological)

Analysis and Risk Rating

The process is related to understanding the likelihood of a risk and the potential consequences if an accident or incident occurs.

Risks can be classified as:

Intolerable: Should never arise or result in an accident.
As Low as Reasonably Practicable (ALARP): Should minimize the possibility of a risk given constraints such as cost and schedule.
Acceptable: The consequences of risk are acceptable, and no extra costs must be incurred for reducing the likelihood of danger.

Risk Levels

Imagen

Social Acceptability of Risks

The acceptability of a risk is determined by considerations of human, social, and political factors.

In most societies, the boundaries between regions are moved upwards with time, i.e., society is less likely to take risks.
For example, the costs of cleaning up pollution may be less than the costs of prevention, but this may not be socially acceptable.
Risk assessment is subjective.
Risks are identified as likely, unlikely, etc. This depends on who is doing the evaluation.

Risk Assessment

Estimates the probability densities and severity of risk.

Usually, you cannot do this accurately, and thus figures are used as unlikely, rare, very high, etc.
The goal should be to exclude the risks whose occurrences are likely or those with high severity.

Risk Assessment – Insulin Pump

Imagen

Decomposition of Risk

Is related to the discovery of the causes of the origin of risks in a particular system.

These techniques were derived mainly from safety-critical systems and may be:

Inductive bottom-up techniques: Begin with a system failure proposal and assess hazards that could occur from that failure.
Technical deductive top-down techniques: Begin with a danger and deduct what could be the cause.

Fault Tree Analysis

It is a deductive top-down technique.

Poses a risk or hazard at the root of the tree and identifies system states which could lead to this hazard.
Where appropriate, link these states by means of symbols “and” or “or”.
The goal should be to minimize the number of isolated causes of system failure.

Tree Defects of Insulin Pump

Imagen

Assessment of Risk Reduction

The objective of this process is to identify requirements that specify how reliably the risks should be managed and ensure that accidents/incidents will not occur.

Strategies for risk reduction:

Risk prevention
Detection and removal of hazards
Limitation of damages

Strategic Use

In critical systems, a mix of strategies is typically used.

In a control system for a chemical plant, the system will include sensors to detect and correct excessive pressure in the reactor.
However, it will also include an independent protection system that opens a relief valve in case of detection of high pressure or some danger.

Risks of Software – Insulin Pump

Arithmetic Errors: A calculation may cause overflow or underflow of the value of a variable. May include an exception handler for each type of arithmetic error.
Algorithm Errors: Compare the dose to be released with the previous dose or safe maximum doses. Reduce the dose if it is too high.

Safety Requirements – Insulin Pump

Imagen

Security Specification

The security requirements of a system should be specified separately.

These requirements should be, as discussed above, based on an analysis of potential hazards and risks.
The security requirements generally apply to the system as a whole, rather than being applied to individual subsystems. In terms of systems engineering, the safety of a system is an emerging property.

IEC 61508

It is an international standard for security management, which was specifically designed for protection systems – it does not apply to all safety-critical systems.

Incorporates a life cycle model of security and covers all aspects of security management, from scoping to system retirement.

Security Requirements of Control System

Imagen

The Life Cycle Security

Imagen

Safety Requirements

Functional Security Requirements: These requirements define the functions of system security protection, i.e., define how the system should provide protection.
Integrity Requirements of Safety: These requirements define the reliability and availability of the protection system. They are based on expected usage and classified by a level of safety integrity 1-4.

Specification of Protection

It has some similarities with the security specification.

You cannot specify the protection requirements quantitatively.
The requirements are more often of a type that should not occur.

Differences:

There is no well-defined notion of a life cycle of protection for management.
There are no standards.
Threats are generic rather than specific system hazards.
Mature technology protection (encryption, etc.). However, there are problems with its transfer to general use.
The domain of a single vendor (e.g., Microsoft) means that a large number of systems can be affected by the failure of protection.

The Specification Process Protection

Imagen

Stages in the Specification of Protection

Identification and Evaluation of Assets: The assets (data and programs) and their necessary levels of protection are identified. The protection needed depends on the value of the asset, so that a file with passwords (say) has more value than a set of public Web pages.
Threat Analysis and Risk Assessment: Possible security threats are identified, and risks associated with each of these threats are estimated.
Allocation of Threats: Identified threats are related to assets, so that, for each identified asset, there is a list of associated threats.
Analysis Technology: Protection technologies available and their applicability in relation to identified threats are assessed.
Specification Requirements for Protection: The protection requirements are specified. When appropriate, the protection technologies that can be used to guard against different threats to the system shall be explicitly identified.

Types of Protection Requirements

Identification requirements
Authentication requirements
Authorization requirements
Immunity requirements
Integrity requirements
Requirements for intrusion detection
Requirements not rejection
Requirements of privacy
Terms of reference of audit protection
Protection requirements for system maintenance

Protection Requirements of LIBSYS

Imagen

Specification of Software Reliability

Hardware Reliability: What is the probability of a hardware component failing and how long it would take to repair that component?
Software Reliability: What is the probability that a software component will produce an incorrect output? Software failures are different from hardware failures because the software does not wear out. It can continue operating even after producing an incorrect result set.
Operator Reliability: What is the probability that the operator of a system made a mistake?

Functional Requirements for Reliability

A track of all the default values that are entered by the operator will be defined, and the system will verify that all entries of the operator are within this range.
The system, when started, will check all disks for bad blocks.
The system must use N-version programming to implement the control system brakes.
The system should be implemented in a safe subset of the Ada language and verified using static analysis.

Non-functional Reliability Specification

The required level of system reliability must be expressed quantitatively.
Reliability is an attribute of the dynamic system – reliability specifications related to the source code are significant.
- No more than N defects/1000 lines
- This is useful only for an analysis of the post-delivery process, where you’re trying to gauge how good your techniques are.
An appropriate reliability metric should be chosen to specify the overall reliability of the system.

Reliability Metrics

Reliability metrics are units of measurement of system reliability.

The system reliability is measured by counting the number of operational failures and, where appropriate, relating it to the demands made to the system and the time that the system was operational.
A measurement program in the long term is needed to assess the reliability of critical systems.

Imagen

Probability of Failure on Demand (POFOD)

It is the probability that the system fails when one service request is submitted.
It is useful when the demands are intermittent service and relatively infrequent.
It is suitable for protection systems where services are occasionally requested and where there are serious consequences if the service is not provided.
It is important for many safety-critical systems, with the exception of the management components (e.g., system emergency shutdown at a chemical plant).

Rate of Failure (ROCOF)

Reflects the rate of failures in the system.
A ROCOF of 0.002 means that two failures are likely in 1000 time units of operation, e.g., 2 failures per 1000 hours of operation.
It is important for operating systems and transaction processing systems, where the system has to process a large number of similar requests which are relatively frequent (e.g., processing system for credit cards and reservation system for flights).

Mean Time to Failure (MTTF)

It is the measure of time between observed failures of the system. It is the reciprocal of ROCOF for stable systems.
MTTF of 500 means that the average time between failures is 500 time units.
Relevant for systems with long transactions, i.e., where system processing takes a long time. MTTF must be greater than the transaction time (e.g., systems for computer-aided design, where a designer will work on a project for several hours, and word processing systems).

Availability (AVAL)

It is the measure of the fraction of time the system is available for use.
Takes into account the time to repair and restart.
Availability of 0.998 means that the software is available for 998 of the 1,000 units of time.
Is it relevant to non-stop systems and continuous processing (e.g., telephone switching systems and rail signaling systems)?

Specifying Non-functional Requirements

Reliability measurements do not take into account the consequences of failure.
Transient defects may not have real consequences, but other defects can cause loss or corruption of data and loss of system service.
It may be necessary to identify the different classes of faults and use different metrics for each. The reliability specification should be structured.

Consequences of Failure

When specifying reliability, it is not only the number of crashes that matters, but the consequences of those failures.

Failures that have serious consequences are clearly more damaging than those where repair and recovery are straightforward.
In certain cases, therefore, different reliability specifications for different types of faults can be defined.

Classification of Failures

Imagen

Steps to a Reliability Specification

For each subsystem, consider the consequences of possible system failures.
From the analysis of system failures, divide the failures into appropriate classes.
For each class of failure identified, define the reliability by using an appropriate metric. Different metrics can be used for different reliability requirements.
Identify functional reliability requirements to reduce the chances of critical failures.

System Self-service Bank (ATM)

Each machine on the network is used 300 times per day.
The bank has 1,000 machines.
The lifetime of the software release is 2 years.
Every machine handles approximately 100,000 transactions.
About 300,000 database transactions per day in total.

Specification of Reliability of an Electronic Cash System

Imagen

Validating the Specification

It is impossible to empirically validate very high reliability specifications.
No corruption of the database means POFOD less than 1 in 200 million.
If a transaction takes 1 second, then the simulation of transactions a day takes 3.5 days.
It would take longer than the lifetime of the system to test it for reliability.

Key Points

Risk analysis is the basis for identifying requirements for system reliability.
Risk analysis is related to the assessment of the chances of a risk occurring and the classification of risks according to their severity.
The protection requirements should identify the assets and define how they should be protected.
The reliability requirements can be defined quantitatively.
The reliability metrics include POFOD, ROCOF, MTTF, and AVAL.
Non-functional specifications of reliability can lead to functional requirements of system failures to reduce or cope with their occurrence.
The software architecture is the fundamental framework for structuring the system.
Architectural design decisions include decisions on the type of application, distribution, and styles of architecture to be used.
Different models of architecture, such as a model structure, a control model, and a model of decomposition can be developed.
Organizational models of a system include repository models, client-server models, and abstract machines.
Decomposition models include modular object models and pipelining models.
Control models include models of centralized control and targeting events.
Reference architectures can be used to communicate domain-specific architectures, assess and compare architectural projects.
Different models of architecture can be produced during the design process.
Each model presents different perspectives on architecture.

Attributes Architecture

Performance: Find critical operations and minimize communications.
Protection: Using a layered architecture with critical issues in the inner layers.
Security: Isolate safety-critical components.
Availability: Include redundant components in the architecture.
Serviceability: Using replaceable components and low granularity.

Prototyping

Prototyping is an approach based on an evolutionary view of software development, affecting the process as a whole. This approach involves the production of early versions – prototypes (similar to models for architecture) – of a future system with which one can conduct checks and trials to evaluate some of their qualities before the system will actually be built.

Rapid development of software to validate the requirements.
In the past, the prototype had the sole purpose of assessing the requirements, so traditional development was needed.
Currently, the boundaries between normal development and prototyping systems are often vague, and many systems are developed using an evolutionary approach.

Uses of Prototype Systems

The main use is to help customers and developers understand the requirements for the system.
Waiver of Requirements: Users can experience the prototype to see how the system can support their work.
Validation of Requirements: The prototype can reveal errors and omissions in requirements. Prototyping can be considered as a risk reduction activity which reduces requirements risk.

Prototyping Benefits

Misunderstandings between software users and developers are exposed.
Forgotten services can be detected, and confusing services may be identified.
An operating system is available in the early stages of the development process.
The prototype may serve as a basis for deriving a specification of the system with production quality.
The prototype may be used for user training and system testing.

Prototype Development Process

Prototyping Benefits

Improved usability of the system
Closer approach of the system with users’ needs
Improving the quality of the project
Improved ease of maintenance
Reduction in the development effort in the process of prototyping software

Evolutionary Prototyping: An approach to system development where an initial prototype is produced and refined through several stages until the final system.
Throwaway Prototyping: A prototype which is usually a practical implementation of the system is produced to help raise the issues with the requirements and then discarded. The system is then developed using some other development process.

Goals of Prototyping

The objective of evolutionary prototyping is to provide end-users with a functioning system. The development starts with those requirements that are better understood.
The goal of throwaway prototyping is to validate or derive the system requirements. The prototyping process starts with those requirements that are not well understood.

Prototyping Approaches

Evolutionary Prototyping

Must be used for systems where the specification cannot be developed a priori, for example, AI systems and user interface systems.
Based on techniques that allow rapid interactions for the development of applications.
Verification is impossible since there is no existing specification. Validation means demonstrating the adequacy of the system.

Evolutionary Prototyping Advantages

Fast Delivery System: In some cases, rapid delivery and ease of use are more important than the details of functionality or ease of software maintenance in the long term.
User Engagement with the System: The user’s involvement with the system means a greater ability to meet its requirements and a greater commitment to the system to function accordingly.

Problems with Evolutionary Prototyping

Management Issues: Existing management processes assume the waterfall model of development.
Specialist skills are required and may not be available in the development team.
Problems Maintaining Continuity of Change: Change tends to corrupt the structure of the prototype system, so long-term maintenance can be expensive.
Contractual Issues: Contracts are generally established based on a complete specification of the software.

Incremental Development

The system is developed and released in increments after establishing an overall architecture.
Requirements and specifications for each increment may be developed.
Users can evaluate the increments released while others are being developed. Therefore, this serves as a form of prototype system.
Incremental development combines the advantages of evolutionary prototyping in a development process that is more easily managed and has better structuring of the system.

Throwaway Prototyping

Used to reduce the risks with the requirements.
The prototype is developed from an initial specification, delivered for review, and then discarded.
Throwaway prototypes should not be considered as a final system.
Important features may have been excluded from the prototype.
There is no specification for future maintenance.
The system will be poorly structured and difficult to maintain.

Throwaway Prototypes Releasable

Developers may be pressured to deliver a throwaway prototype as a final product. This is not recommended.
It is impossible to adjust the prototype to meet non-functional requirements.
The prototype is inevitably undocumented, and that’s bad for long-term maintenance.
The changes made during the development of the prototype will probably have degraded the system’s structure.
Organizational quality standards are usually left out in the development of the prototype.

Rapid Prototyping Techniques

Several techniques can be used for prototype development.
- Developing a dynamic language with high-level programming
- Database
- Assembly of components and applications
Such techniques are not unique and are often used together.
Visual programming is an inherent part of most prototype development systems.

Prototyping with Reuse

Development at the Application Level: Entire application systems are integrated with the prototype so that their functionality can be shared. For example, if text editing capability is required, a standard system text editor can be integrated.
Development at the Component Level: Individual components are integrated within a standard framework to implement the system. The framework can be a scripting language (Visual Basic or Perl) or an integration framework (CORBA or JavaBeans).

Visual Programming with Reuse

Scripting languages such as Visual Basic support visual programming where the prototype is developed by creating a user interface from standard items (screens, fields, buttons, and menus) and associating components with these items.
A large library of components exists to support this development.

Key Points

A prototype system can be used to give end-users an impression of the actual capabilities of the system.
Prototyping is becoming more common for system development where rapid development is essential.
Throwaway prototypes are used to understand the system requirements.
In evolutionary prototyping, the system is developed by evolving an initial version into a final version of the system.
Rapid development is important in prototyping systems. This may lead to the exclusion of system functionality or reduction of non-functional requirements.
Prototyping techniques include the use of very high-level languages, programming data bunches, and building prototypes from reusable components.
Prototyping is essential for the development of user interfaces, which are difficult to be specified using a static model. Users should be involved in the evaluation and development of the prototype.

Agile

What is an Agile Methodology?

Ad-hoc software development usually produces very bad results, especially in large systems!

Traditional engineering placed great emphasis on the project before you build.

Traditional View of Software Engineering

Now we want to change how we develop software.