odin
Fundamental Concepts of Dependability
Algirdas Avizˇienis
UCLA Computer Science Dept.
Univ. of California, Los Angeles
USA
Jean-Claude Laprie
LAAS-CNRS
Toulou
France
Brian Randell
Dept. of Computing Science
Univ. of Newcastle upon Tyne
U.K.
UCLA CSD Report no. 010028LAAS Report no. 01-145Newcastle University Report no.
CS-TR-739 LIMITED DISTRIBUTION NOTICE
This report has been submitted for publication. It has been issued as a rearch report for early peer distribution.
Abstract Dependability is the system property that integrates such attributes as reliability, availability, safety, curity, survivability, maintainability. T he aim of the
prentation is to summarize the fundamental concepts of dependability. After a
historical perspective, definitions of dependability are given. A structured view of
dependability follows, according to a) the threats, i.e., faults, errors and failures,
b) the attributes, and c) the means for dependability, that are fault prevention,annual
fault tolerance, fault removal and fault forecasting.
T he protection and survival of complex information systems that are embedded in the
infrastructure supporting advanced society has become a national and world-wide concern of the highest priority 1. Increasingly, individuals and organizations are developing or procuring sophisticated computing systems on who rvices they need to place great reliance — whether to rvice a t of cash dispenrs, control a satellite constellation, an airplane, a nuclear plant, or a radiation therapy device, or to maintain the confidentiality of a nsitive data ba. In differing circumstances, the focus will be on differing properties of such rvices — e.g., on the average real-time respon achieved, the likelihood of producing the required results, the ability to avoid failures that could be catastrophic to the system's environment, or the degree to which deliberate intrusions can be prevented. The notion of dependability provides a very convenient means of subsuming the various concerns within a single conceptual framework.
Our goal is to prent a conci overview of the concepts, techniques and tools that have evolved over the past forty years in the field of dependable computing and fault tolerance.
ORIGINS AND INTEGRATION OF THE CONCEPTS
The delivery of correct computing and communication rvices has been a concern of their providers and urs since the earliest days. In the July 1834 issue of the Edinburgh Review, Dr. Dionysius Lardner published the article “Babbages’s calculating engine”, in which he wrote:
如何招人“T he most certain and effectual check upon errors which ari in the process of
computation, is to cau the same computations to be made by parate and independent
computers; and this check is rendered still more decisive if they make their computations by
different methods”.
The first generation of electronic computers (late 1940’s to mid-50’s) ud rather unreliable components, therefore practical techniques were employed to improve their reliability, such as error control codes, duplexing with comparison, triplication with voting, diagnostics to locate failed components, etc. At the same time J. von Neumann, E. F. Moore and C. E. Shannon, and their successors developed theories of using redundancy to build reliable logic structures from less reliable components, who faults were masked by the prence of multiple redundant components. The theories of masking redundancy were unified by W. H. Pierce as the concept of failure tolerance in 1965 (Academic Press). In 1967, A. Avizienis integrated masking with the practical techniques of error detection, fault diagnosis, and recovery into the concept of fault-tolerant systems 2. In the reliability modeling field, the major event was the introduction of the coverage concept by Bouricius, Carter and
Schneider 3. Seminal work on software fault tolerance was initiated by B. Randell 4, later it was complemented by N-version programming 5.
The formation of the IEEE-CS T C on Fault-T olerant Computing in 1970 and of IFIP WG 10.4Dependable Computing and Fault T olerance in 1980 accelerated the emergence of a consistent t of concepts and terminology. Seven position papers were prented in 1982 at FTCS-12 in a special ssion on fundamental concepts of fault tolerance, and J.-C. Laprie formulated a synthesis in 1985 6.Further work by members of IFIP WG 10.4, led by J.-C. Laprie, resulted in the 1992 book Dependability: Basic Concepts and T erminology (Springer-Verlag), in which the English text was also translated into French, German, Italian, and Japane.
In this book, intentional faults (malicious logic, intrusions) were listed along with accidental faults (physical, design, or interaction faults). Exploratory rearch on the integration of fault tolerance and the defens against deliberately malicious faults, i.e., curity threats, was started in the mid-80’s 7-9.The first IFIP Working Conference on Dependable Computing for Critical Applications was held in 1989. This and the six Working Conferences that followed fostered the interaction of the dependability and curity communities, and advanced the integration of curity (confidentiality, integrity and availability) into the framework of dependable computing 10.
THE DEFINITIONS OF DEPENDABILITY
A systematic exposition of the concepts of dependability consists of three parts: the threats to, the attributes of, and the means by which dependability is attained, as shown in Figure 1.
DEPENDABILITY ATTRIBUTES
AVAILABILITY RELIABILITY SAFETY
CONFIDENTIALITY INTEGRITY
MAINTAINABILITY FAULT PREVENTION FAULT TOLERANCE FAULT REMOVAL
FAULT FORECASTING
MEANSfortemedia
THREATS
发声翻译
FAULTS ERRORS FAILURES Figure 1 - The dependability tree
genie
Computing systems are characterized by five fundamental properties: functionality, usability,performance, cost, and dependability. Dependability of a computing system is the ability to deliver rvice that can justifiably be trusted. The rvice delivered by a system is its behavior as it is perceived by its ur(s); a ur is another system (physical, human) that interacts with the former at the rvice interface . The function of a system is what the system is intended to do, and is described by the functional specification. Correct rvice is delivered when the rvice implements the system function. A system failure is an event that occurs when the delivered rvice deviates from correct rvice. A failure is thus a transition from correct rvice to incorrect rvice , i.e., to not implementing the system function. The delivery of incorrect rvice is a system outage . A transition from incorrect rvice to correct rvice is rvice restoration . Bad on the definition of failure, an
alternate definition of dependability, which complements the initial definition in providing a criterion for adjudicating whether the delivered rvice can be trusted or not: the ability of a system to avoid failures that are more frequent or more vere, and outage durations that are longer, than is acceptable to the ur(s). In the opposite ca, the system is no longer dependable: it suffers from a dependability failure, that is a m eta-failure.
THE THREATS: FAULTS, ERRORS, AND FAILURES
A system may fail either becau it does not comply with the specification, or becau the specification did not adequately describe its function. An error is that part of the system state that may cau a subquent failure: a failure occurs when an error reaches the rvice interface and alters the rvice. A fault is the adjudged or hypothesized cau of an error. A fault is active when it produces an error; otherwi it is dormant.
A system does not always fail in the same way. The ways in which a system can fail are its failure m odes. The can be ranked according to failure verities. The modes characterize incorrect rvice according to four viewpoints:
•the failure domain,
•the controllability of failures,
•the consistency of failures, when a system has two or more urs,
•the conquences of failures on the environment.
Figure 2 shows the modes of failures according to the above viewpoints, as well as failure symptoms which result from the combination of the domain, controlability and consistency viewpoints. The failur
e symptoms can be mapped into the failure verities as resulting from grading the conquences of failures.
Figure 2 - The failure modes
A system consists of a t of interacting components, therefore the system state is the t of its component states. A fault originally caus an error within the state of one (or more) components, but system failure will not occur as long as the error does not reach the rvice interface of the system. A convenient classification of errors is to describe them in terms of the component failures th
at they cau, using the terminology of Figure 2: value vs. timing errors; consistent vs. inconsistent (‘Byzantine’) errors when the output goes to two or more components; errors of different verities:
minor vs. ordinary vs. catastrophic errors. An error is detected if its prence is indicated by an error m essage or error signal . Errors that are prent but not detected are latent errors.
Faults and their sources are very diver. Their classification according to six major criteria is prented in Figure 3. It could be argued that introducing phenomenological caus in the classification criteria of faults may lead recursively to questions such as ‘why do programmers make mistakes?’, ‘why do integrated circuits fail?’ Fault is a concept that rves to stop recursion. H ence the definition given:adjudged or hypothesized cau of an error. This cau may vary depending upon the viewpoint that is chon: fault tolerance mechanisms, maintenance engineer, repair shop, developer, miconductor physicist, etc.
PERSISTENCE
PERMANENT FAULTS
TRANSIENT FAULTS
HARDWARE FAULTS
SOFTWARE FAULTS
DOMAIN
NATURAL FAULTS HUMAN-MADE FAULTS
PHENOMENOLOGICAL CAUSE
SYSTEM BOUNDARIES
INTERNAL FAULTS
EXTERNAL FAULTS PHASE OF CREATION OR OCCURENCE
轮虫DEVELOPMENTAL FAULTS OPERATIONAL FAULTS FAULTS
ACCIDENTAL, OR NON-MALICIOUS DELIBERATE ,FAULTS DELIBERATELY MALICIOUS FAULTS
hanyiyingI NTENT Figure 3 - Elementary fault class
Combining the elementary fault class of figure 3 leads to the tree of the upper part of figure 4. The
leaves of the tree are gathered into three major fault class for which defens need to be devid:design faults , physical faults , interaction faults . The boxes of figure 4 point at generic illustrative fault class.
Non-malicious deliberate faults can ari during either development or operation. During development,they result generally from tradeoffs, either a) aimed at prerving acceptable performance and facilitating system utilization, or b) induced by economic considerations; such faults can be sources of curity breaches, in the form of covert channels . Non-malicious deliberate interaction faults may result from the action of an operator either aimed at overcoming an unforeen situation, or deliberately violating an operating procedure without having realized the possibly damaging conquences of his or her action. Non-malicious deliberate faults share the property that often it is recognized that they were faults only after an unacceptable system behavior, thus a failure, has ensued; the specifier(s), designer(s),implementer(s) or operator(s) did not realize that the conquence of some decision of theirs was a fault.
Malicious faults fall into two class: a) malicious logics 11, that encompass developmental faults su
ch as T rojan hors , logic or timing bombs , and trapdoors , as well as operational faults (with respect to the given system) such as virus or worms , and b) intrusions . There are interesting and obvious similarities between an intrusion that exploits an internal fault and a physical external fault that
top up
‘exploits’ a lack of shielding. It is in addition noteworthy that a) the external character of intrusions does not exclude the possibility that they may be attempted by system operators or administrators who are exceeding their rights, and that b) intrusions may u physical means to cau faults: power fluctuation, radiation, wire-tapping, etc.
Figure 4 - Combined fault class
Some design faults affecting software can cau so-called software aging, i.e., progressively accrued error conditions resulting in performance degradation or activation of elusive faults. Examples are memory bloating and leaking, unrelead file-locks, storage space fragmentation.
The relationship between faults, errors, and failures is addresd in Appendix 1.
Two final comments about the words, or labels, ‘fault’, ‘error’, and ‘failure’:
不到长城非好汉英语
a)though we have chon to u just the terms in this document, and to employ adjectives to distinguish different kinds of faults, errors and failures, we recognize the potential convenience of using words that designate, briefly and unambiguously, a specific class of threats; this is especially applicable to faults (e.g. bug, flaw, defect, deficiency, erratum) and to failures (e.g. breakdown, malfunction, denial-of-rvice);
b)the mantics of the terms fault, error, failure reflect current usage: i) fault prevention, tolerance, and diagnosis, ii) error detection and correction, iii) failure rate, failure mode.
THE ATTRIBUTES OF DEPENDABILITY
Dependability is an integrative concept that encompass the following basic attributes: