System operation Software Engineering 9th ed (intro txt) I. Sommerville (Pearson, 2011)
282 Chapter 10
■
Sociotechnical systems height or speed, when an emergency occurs, and so on. For new systems, these oper-
ational processes have to be defined and documented during the system development process. Operators may have to be trained and other work processes adapted to make
effective use of the new system. Undetected problems may arise at this stage because the system specification may contain errors or omissions. Although the system may
perform to specification, its functions may not meet the real operational needs. Consequently, the operators may not use the system as its designers intended.
The key benefit of having system operators is that people have a unique capabil- ity of being able to respond effectively to unexpected situations, even when they
have never had direct experience of these situations. Therefore, when things go wrong, the operators can often recover the situation although this may sometimes
mean that the defined process is violated. Operators also use their local knowledge to adapt and improve processes. Normally, the actual operational processes are differ-
ent from those anticipated by the system designers.
Consequently, you should design operational processes to be flexible and adapt- able. The operational processes should not be too constraining, they should not
require operations to be done in a particular order, and the system software should not rely on a specific process being followed. Operators usually improve the process
because they know what does and does not work in a real situation.
A problem that may only emerge after the system goes into operation is the oper- ation of the new system alongside existing systems. There may be physical problems
of incompatibility or it may be difficult to transfer data from one system to another. More subtle problems might arise because different systems have different user
interfaces. Introducing a new system may increase the operator error rate, as the operators use user interface commands for the wrong system.
10.5.1 Human error
I suggested earlier in the chapter that non-determinism was an important issue in sociotechnical systems and that one reason for this is that the people in the system do
not always behave in the same way. Sometimes they make mistakes in using the sys- tem and this has the potential to cause system failure. For example, an operator may
forget to record that some action has been taken so that another operator erro- neously repeats that action. If the action is to debit or credit a bank account, say,
then a system failure occurs as the amount in the account is then incorrect.
As Reason discusses 2000 human errors will always occur and there are two ways to view the problem of human error:
1. The person approach. Errors are considered to be the responsibility of the indi-
vidual and ‘unsafe acts’ such as an operator failing to engage a safety barrier are a consequence of individual carelessness or reckless behavior. People who
adopt this approach believe that human errors can be reduced by threats of dis- ciplinary action, more stringent procedures, retraining, etc. Their view is that
the error is the fault of the individual responsible for making the mistake.
10.5
■
System operation 283
2. The systems approach. The basic assumption is that people are fallible and will
make mistakes. The errors that people make are often a consequence of system design decisions that lead to erroneous ways of working, or of organizational
factors, which affect the system operators. Good systems should recognize the possibility of human error and include barriers and safeguards that detect human
errors and allow the system to recover before failure occurs. When a failure does occur, the issue is not finding an individual to blame but to understand how and
why the system defences did not trap the error.
I believe that the systems approach is the right one and that systems engineers should assume that human errors will occur during system operation. Therefore, to
improve the security and dependability of a system, designers have to think about the defenses and barriers to human error that should be included in a system. They
should also think about whether these barriers should be built into the technical com- ponents of the system. If not, they could be part of the processes and procedures for
using the system or could be operator guidelines that are reliant on human checking and judgment.
Examples of defenses that may be included in a system are: 1.
An air traffic control system may include an automated conflict alert system. When a controller instructs an aircraft to change its speed or altitude, the system
extrapolates its trajectory to see if it could intersect with any other aircraft. If so, it sounds an alarm.
2. The same system may have a clearly defined procedure to record the control
instructions that have been issued. These procedures help the controller check if they have issued the instruction correctly and make the information available to
others for checking.
3. Air traffic control usually involves a team of controllers who constantly monitor
each other’s work. Therefore, when a mistake is made, it is likely that it will be detected and corrected before an incident occurs.
Inevitably, all barriers have weaknesses of some kind. Reason calls these ‘latent conditions’ as they usually only contribute to system failure when some other prob-
lem occurs. For example, in the above defenses, a weakness of a conflict alert system is that it may lead to many false alarms. Controllers may therefore ignore warnings
from the system. A weakness of a procedural system may be that unusual but essen- tial information can’t be easily recorded. Human checking may fail when all of the
people involved are under stress and make the same mistake.
Latent conditions lead to system failure when the defenses built into the system do not trap an active failure by a system operator. The human error is a trigger for
the failure but should not be considered to be the sole cause of the failure. Reason explains this using his well-known ‘Swiss cheese’ model of system failure
Figure 10.9.
284 Chapter 10
■
Sociotechnical systems
In this model, the defenses built into a system are compared to slices of Swiss cheese. Some types of Swiss cheese, such as Emmental, have holes and so the anal-
ogy is that the latent conditions are comparable to the holes in cheese slices. The position of these holes is not static but changes depending on the state of the overall
sociotechnical system. If each slice represents a barrier, failures can occur when the holes line up at the same time as a human operational error. An active failure of sys-
tem operation gets through the holes and leads to an overall system failure.
Normally, of course, the holes should not be aligned so operational failures are trapped by the system. To reduce the probability that system failure will result from
human error, designers should: 1.
Design a system so that different types of barriers are included. This means that the ‘holes’ will probably be in different places and so there is less chance of the
holes lining up and failing to trap an error.
2. Minimize the number of latent conditions in a system. Effectively, this means
reducing the number and size of system ‘holes’. Of course, the design of the system as a whole should also attempt to avoid the
active failures that can trigger a system failure. This may involve designing the oper- ational processes and the system to ensure that operators are not overworked, dis-
tracted, or presented with excessive amounts of information.
10.5.2 System evolution
Large, complex systems have a very long lifetime. During their life, they are changed to correct errors in the original system requirements and to implement new
requirements that have emerged. The system’s computers are likely to be replaced with new, faster machines. The organization that uses the system may reorganize
itself and hence use the system in a different way. The external environment of the system may change, forcing changes to the system. Hence evolution, where the sys-
tem changes to accommodate environmental change, is a process that runs alongside normal system operational processes. System evolution involves reentering the
development process to make changes and extensions to the system’s hardware, soft- ware, and operational processes.
Figure 10.9 Reason’s Swiss cheese
model of system failure System Failure
Active Failure Human Error
Barriers
10.5
■
System operation 285
Legacy systems Legacy systems are sociotechnical computer-based systems that have been developed in the past, often using
older or obsolete technology. These systems include not only hardware and software but also legacy processes and procedures—old ways of doing things that are difficult to change because they rely on legacy software.
Changes to one part of the system inevitably involve changes to other components. Legacy systems are often business-critical systems. They are maintained because it is too risky to replace them.
http:www.SoftwareEngineering-9.comLegacySys
System evolution, like software evolution discussed in Chapter 9, is inherently costly for several reasons:
1. Proposed changes have to be analyzed very carefully from a business and a tech-
nical perspective. Changes have to contribute to the goals of the system and should not simply be technically motivated.
2. Because subsystems are never completely independent, changes to one subsys-
tem may adversely affect the performance or behavior of other subsystems. Consequent changes to these subsystems may therefore be needed.
3. The reasons for original design decisions are often unrecorded. Those responsi-
ble for the system evolution have to work out why particular design decisions were made.
4. As systems age, their structure typically becomes corrupted by change so the
costs of making further changes increases. Systems that have evolved over time are often reliant on obsolete hardware and soft-
ware technology. If they have a critical role in an organization, they are known as ‘legacy systems’. These are usually systems that the organization would like to
replace but don’t do so as the risks or costs of replacement cannot be justified.
From a dependability and security perspective, changes to a system are often a source of problems and vulnerabilities. If the people implementing the change are dif-
ferent from those who developed the system, they may be unaware that a design deci- sion was made for dependability and security reasons. Therefore, they may change the
system and lose some safeguards that were deliberately implemented when the system was built. Furthermore, as testing is so expensive, complete retesting may be impossi-
ble after every system change. Adverse side effects of changes that introduce or expose faults in other system components may not then be discovered.
286 Chapter 10
■
Sociotechnical systems
K E Y P O I N T S
■ Sociotechnical systems include computer hardware, software, and people, and are situated within
an organization. They are designed to support organizational or business goals and objectives. ■
Human and organizational factors such as organizational structure and politics have a significant effect on the operation of sociotechnical systems.
■ The emergent properties of a system are characteristics of the system as a whole rather than of its
component parts. They include properties such as performance, reliability, usability, safety, and security. The success or failure of a system is often dependent on these emergent properties.
■ The fundamental systems engineering processes are system procurement, system development,
and system operation. ■
System procurement covers all of the activities involved in deciding what system to buy and who should supply that system. High-level requirements are developed as part of the procurement
process.
■ System development includes requirements specification, design, construction, integration, and
testing. System integration, where subsystems from more than one supplier must be made to work together, is particularly critical.
■ When a system is put into use, the operational processes and the system itself have to change
to reflect changing business requirements. ■
Human errors are inevitable and systems should include barriers to detect these errors before they lead to system failure. Reason’s Swiss cheese model explains how human error plus latent
defects in the barriers can lead to system failure.
F U R T H E R R E A D I N G
‘Airport 95: Automated baggage system’. An excellent, readable case study of what can go wrong with a systems engineering project and how software tends to get the blame for wider systems
failures. ACM Software Engineering Notes, 21, March 1996. http:doi.acm.org10.1145227531.227544.
‘Software system engineering: A tutorial’. A good general overview of systems engineering, although Thayer focuses exclusively on computer-based systems and does not discuss
sociotechnical issues. R. H. Thayer. IEEE Computer, April 2002. http:dx.doi.org10.1109MC.2002.993773.
Trust in Technology: A Socio-technical Perspective. This book is a set of papers that are all concerned, in some way, with the dependability of sociotechnical systems. K. Clarke,
G. Hardstone, M. Rouncefield and I. Sommerville eds., Springer, 2006.
‘Fundamentals of Systems Engineering’. This is the introductory chapter in NASA’s systems engineering handbook. It presents an overview of the systems engineering process for space
Chapter 10
■
Exercises 287
systems. Although these are mostly technical systems, there are sociotechnical issues to be considered. Dependability is obviously critically important. In NASA Systems Engineering
Handbook, NASA-SP2007-6105, 2007. http:education.ksc.nasa.govesmdspacegrant DocumentsNASA20SP-2007-610520Rev20120Final2031Dec2007.pdf.
E X E R C I S E S