{"title":"How Do Systems Fail?","authors":"B. O’Halloran, Douglas L. Van Bossuyt","doi":"10.1109/RAMS48030.2020.9153715","DOIUrl":null,"url":null,"abstract":"Summary & ConclusionsModern systems are changing quickly and becoming more complex through increased connectivity, smaller packaging, higher performance requirements, more components, the inclusion of complex software and Artificial Intelligence (AI), and much more. The following are high-level challenges that arise in many modern systems. The first is the distribution of the system, which are both physical (e.g., power grids) and digital (e.g., air traffic control, transportation networks). With highly distributed system, the vulnerability from the environment becomes significant. The second challenge is the implementation of new technology where examples include driverless vehicles and Boeing’s 787 Dreamliner. Occasionally implementing new technology doesn’t lend well to their intended purpose as observed by the Supersonic Transport (SST) aircrafts for commercial flights such as Concorde [1] and the Tupolev Tu-144 [2]. This industry suffered a major crash, Air France Flight 4590, that killed 109 passengers and crew and led to the ultimate demise of the industry [3]. The result of these design challenges is the need for improved methods to identify, assess, and mitigate off-nominal behavior. While all industries seek to create safe and reliability systems, their failures continue to splash across the news with surprising regularity. The examples are nearly endless. Across 63 years (1957–2019) there have been 402 mission failures in the spaceflight industry including satellites, manned spacecrafts, rockets, etc. As a subset of these missions, the manned spaceflight industry has seen 118 failures with a total of 262 deaths [4]; there have been 5 manned flight incidents where 19 astronauts died, 8 training or testing incidents where 11 astronauts died, 35 incidents where a total of 232 non-astronauts died (e.g., civilians, employees, etc.), and 70 incidents (35 flight and 35 training or testing) where no deaths occurred. Beyond the 402 mission failures, there have also been 118 Satellite launch failures [4]. Since the introduction of the commercial airline industry in 1918, there have been a reported 154,984 deaths [3]. Since 1970, there have been 11,634 accidents. Even more alarming is that the annual death rate hasn’t decreased much with time. The death rate per year between 1970–2018 is 1722 and between 1990–2018 is 1337. While this has reduced, a large number of accidents continue to cause a large number of deaths in this industry. According to [5], there have been 25 major dam failures, 16 of which have occurred in the last 50 years. The nuclear power industry has observed over 100 failures, several of which have resulted mitigations exceeding a billion US dollars. It is important to note that systems fail with regularity regardless of the system’s type, purpose, or age, the industry that the system belongs, or the era in which it was designed and built. The continued increase in what we demand from our systems has always trumped the practitioner’s ability to assess and mitigate off-nominal behavior. These facts show that failure has always been imminent. Until significant improvements are made to the way that we assess and mitigate failures, it is unreasonable to consider the outcome to change. As such, one element of assessment is to understand the variety of causes that involve the failures we observe. As such, this paper seeks to characterize failures by their cause. This is done by surveying a large number of failures from several different relevant industries, then deriving categories of failure cause. Seven categories of failures are identified including: development failures, induced failures, common cause failures, propagated failures, interaction failures, malicious failures, and management, customer, and misuse failures. By understanding the different classes of failures potentially present in complex systems, engineers can better choose which failure, risk, and reliability analysis tools are most appropriate to use with specific systems. This in turn may lead to more reliable systems that are less prone to failure throughout the system lifecycle.","PeriodicalId":360096,"journal":{"name":"2020 Annual Reliability and Maintainability Symposium (RAMS)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Annual Reliability and Maintainability Symposium (RAMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RAMS48030.2020.9153715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Summary & ConclusionsModern systems are changing quickly and becoming more complex through increased connectivity, smaller packaging, higher performance requirements, more components, the inclusion of complex software and Artificial Intelligence (AI), and much more. The following are high-level challenges that arise in many modern systems. The first is the distribution of the system, which are both physical (e.g., power grids) and digital (e.g., air traffic control, transportation networks). With highly distributed system, the vulnerability from the environment becomes significant. The second challenge is the implementation of new technology where examples include driverless vehicles and Boeing’s 787 Dreamliner. Occasionally implementing new technology doesn’t lend well to their intended purpose as observed by the Supersonic Transport (SST) aircrafts for commercial flights such as Concorde [1] and the Tupolev Tu-144 [2]. This industry suffered a major crash, Air France Flight 4590, that killed 109 passengers and crew and led to the ultimate demise of the industry [3]. The result of these design challenges is the need for improved methods to identify, assess, and mitigate off-nominal behavior. While all industries seek to create safe and reliability systems, their failures continue to splash across the news with surprising regularity. The examples are nearly endless. Across 63 years (1957–2019) there have been 402 mission failures in the spaceflight industry including satellites, manned spacecrafts, rockets, etc. As a subset of these missions, the manned spaceflight industry has seen 118 failures with a total of 262 deaths [4]; there have been 5 manned flight incidents where 19 astronauts died, 8 training or testing incidents where 11 astronauts died, 35 incidents where a total of 232 non-astronauts died (e.g., civilians, employees, etc.), and 70 incidents (35 flight and 35 training or testing) where no deaths occurred. Beyond the 402 mission failures, there have also been 118 Satellite launch failures [4]. Since the introduction of the commercial airline industry in 1918, there have been a reported 154,984 deaths [3]. Since 1970, there have been 11,634 accidents. Even more alarming is that the annual death rate hasn’t decreased much with time. The death rate per year between 1970–2018 is 1722 and between 1990–2018 is 1337. While this has reduced, a large number of accidents continue to cause a large number of deaths in this industry. According to [5], there have been 25 major dam failures, 16 of which have occurred in the last 50 years. The nuclear power industry has observed over 100 failures, several of which have resulted mitigations exceeding a billion US dollars. It is important to note that systems fail with regularity regardless of the system’s type, purpose, or age, the industry that the system belongs, or the era in which it was designed and built. The continued increase in what we demand from our systems has always trumped the practitioner’s ability to assess and mitigate off-nominal behavior. These facts show that failure has always been imminent. Until significant improvements are made to the way that we assess and mitigate failures, it is unreasonable to consider the outcome to change. As such, one element of assessment is to understand the variety of causes that involve the failures we observe. As such, this paper seeks to characterize failures by their cause. This is done by surveying a large number of failures from several different relevant industries, then deriving categories of failure cause. Seven categories of failures are identified including: development failures, induced failures, common cause failures, propagated failures, interaction failures, malicious failures, and management, customer, and misuse failures. By understanding the different classes of failures potentially present in complex systems, engineers can better choose which failure, risk, and reliability analysis tools are most appropriate to use with specific systems. This in turn may lead to more reliable systems that are less prone to failure throughout the system lifecycle.