系统是如何失败的?

B. O’Halloran, Douglas L. Van Bossuyt
{"title":"系统是如何失败的?","authors":"B. O’Halloran, Douglas L. Van Bossuyt","doi":"10.1109/RAMS48030.2020.9153715","DOIUrl":null,"url":null,"abstract":"Summary & ConclusionsModern systems are changing quickly and becoming more complex through increased connectivity, smaller packaging, higher performance requirements, more components, the inclusion of complex software and Artificial Intelligence (AI), and much more. The following are high-level challenges that arise in many modern systems. The first is the distribution of the system, which are both physical (e.g., power grids) and digital (e.g., air traffic control, transportation networks). With highly distributed system, the vulnerability from the environment becomes significant. The second challenge is the implementation of new technology where examples include driverless vehicles and Boeing’s 787 Dreamliner. Occasionally implementing new technology doesn’t lend well to their intended purpose as observed by the Supersonic Transport (SST) aircrafts for commercial flights such as Concorde [1] and the Tupolev Tu-144 [2]. This industry suffered a major crash, Air France Flight 4590, that killed 109 passengers and crew and led to the ultimate demise of the industry [3]. The result of these design challenges is the need for improved methods to identify, assess, and mitigate off-nominal behavior. While all industries seek to create safe and reliability systems, their failures continue to splash across the news with surprising regularity. The examples are nearly endless. Across 63 years (1957–2019) there have been 402 mission failures in the spaceflight industry including satellites, manned spacecrafts, rockets, etc. As a subset of these missions, the manned spaceflight industry has seen 118 failures with a total of 262 deaths [4]; there have been 5 manned flight incidents where 19 astronauts died, 8 training or testing incidents where 11 astronauts died, 35 incidents where a total of 232 non-astronauts died (e.g., civilians, employees, etc.), and 70 incidents (35 flight and 35 training or testing) where no deaths occurred. Beyond the 402 mission failures, there have also been 118 Satellite launch failures [4]. Since the introduction of the commercial airline industry in 1918, there have been a reported 154,984 deaths [3]. Since 1970, there have been 11,634 accidents. Even more alarming is that the annual death rate hasn’t decreased much with time. The death rate per year between 1970–2018 is 1722 and between 1990–2018 is 1337. While this has reduced, a large number of accidents continue to cause a large number of deaths in this industry. According to [5], there have been 25 major dam failures, 16 of which have occurred in the last 50 years. The nuclear power industry has observed over 100 failures, several of which have resulted mitigations exceeding a billion US dollars. It is important to note that systems fail with regularity regardless of the system’s type, purpose, or age, the industry that the system belongs, or the era in which it was designed and built. The continued increase in what we demand from our systems has always trumped the practitioner’s ability to assess and mitigate off-nominal behavior. These facts show that failure has always been imminent. Until significant improvements are made to the way that we assess and mitigate failures, it is unreasonable to consider the outcome to change. As such, one element of assessment is to understand the variety of causes that involve the failures we observe. As such, this paper seeks to characterize failures by their cause. This is done by surveying a large number of failures from several different relevant industries, then deriving categories of failure cause. Seven categories of failures are identified including: development failures, induced failures, common cause failures, propagated failures, interaction failures, malicious failures, and management, customer, and misuse failures. By understanding the different classes of failures potentially present in complex systems, engineers can better choose which failure, risk, and reliability analysis tools are most appropriate to use with specific systems. This in turn may lead to more reliable systems that are less prone to failure throughout the system lifecycle.","PeriodicalId":360096,"journal":{"name":"2020 Annual Reliability and Maintainability Symposium (RAMS)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"How Do Systems Fail?\",\"authors\":\"B. O’Halloran, Douglas L. Van Bossuyt\",\"doi\":\"10.1109/RAMS48030.2020.9153715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary & ConclusionsModern systems are changing quickly and becoming more complex through increased connectivity, smaller packaging, higher performance requirements, more components, the inclusion of complex software and Artificial Intelligence (AI), and much more. The following are high-level challenges that arise in many modern systems. The first is the distribution of the system, which are both physical (e.g., power grids) and digital (e.g., air traffic control, transportation networks). With highly distributed system, the vulnerability from the environment becomes significant. The second challenge is the implementation of new technology where examples include driverless vehicles and Boeing’s 787 Dreamliner. Occasionally implementing new technology doesn’t lend well to their intended purpose as observed by the Supersonic Transport (SST) aircrafts for commercial flights such as Concorde [1] and the Tupolev Tu-144 [2]. This industry suffered a major crash, Air France Flight 4590, that killed 109 passengers and crew and led to the ultimate demise of the industry [3]. The result of these design challenges is the need for improved methods to identify, assess, and mitigate off-nominal behavior. While all industries seek to create safe and reliability systems, their failures continue to splash across the news with surprising regularity. The examples are nearly endless. Across 63 years (1957–2019) there have been 402 mission failures in the spaceflight industry including satellites, manned spacecrafts, rockets, etc. As a subset of these missions, the manned spaceflight industry has seen 118 failures with a total of 262 deaths [4]; there have been 5 manned flight incidents where 19 astronauts died, 8 training or testing incidents where 11 astronauts died, 35 incidents where a total of 232 non-astronauts died (e.g., civilians, employees, etc.), and 70 incidents (35 flight and 35 training or testing) where no deaths occurred. Beyond the 402 mission failures, there have also been 118 Satellite launch failures [4]. Since the introduction of the commercial airline industry in 1918, there have been a reported 154,984 deaths [3]. Since 1970, there have been 11,634 accidents. Even more alarming is that the annual death rate hasn’t decreased much with time. The death rate per year between 1970–2018 is 1722 and between 1990–2018 is 1337. While this has reduced, a large number of accidents continue to cause a large number of deaths in this industry. According to [5], there have been 25 major dam failures, 16 of which have occurred in the last 50 years. The nuclear power industry has observed over 100 failures, several of which have resulted mitigations exceeding a billion US dollars. It is important to note that systems fail with regularity regardless of the system’s type, purpose, or age, the industry that the system belongs, or the era in which it was designed and built. The continued increase in what we demand from our systems has always trumped the practitioner’s ability to assess and mitigate off-nominal behavior. These facts show that failure has always been imminent. Until significant improvements are made to the way that we assess and mitigate failures, it is unreasonable to consider the outcome to change. As such, one element of assessment is to understand the variety of causes that involve the failures we observe. As such, this paper seeks to characterize failures by their cause. This is done by surveying a large number of failures from several different relevant industries, then deriving categories of failure cause. Seven categories of failures are identified including: development failures, induced failures, common cause failures, propagated failures, interaction failures, malicious failures, and management, customer, and misuse failures. By understanding the different classes of failures potentially present in complex systems, engineers can better choose which failure, risk, and reliability analysis tools are most appropriate to use with specific systems. This in turn may lead to more reliable systems that are less prone to failure throughout the system lifecycle.\",\"PeriodicalId\":360096,\"journal\":{\"name\":\"2020 Annual Reliability and Maintainability Symposium (RAMS)\",\"volume\":\"61 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Annual Reliability and Maintainability Symposium (RAMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RAMS48030.2020.9153715\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Annual Reliability and Maintainability Symposium (RAMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RAMS48030.2020.9153715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

摘要与结论现代系统正在快速变化,并且通过增加连接性,更小的封装,更高的性能要求,更多的组件,复杂软件和人工智能(AI)的包含等等变得更加复杂。以下是许多现代系统中出现的高级挑战。首先是系统的分布,既包括物理的(如电网),也包括数字的(如空中交通管制、交通网络)。在高度分布式的系统中,来自环境的脆弱性变得非常重要。第二个挑战是新技术的实施,例如无人驾驶汽车和波音787梦想飞机。偶尔实施新技术不能很好地发挥其预期目的,如超音速运输(SST)飞机为商业航班,如协和式[1]和图波列夫图-144[2]观察。该行业遭遇了法国航空4590航班的重大坠机事故,造成109名乘客和机组人员死亡,并导致该行业的最终消亡[3]。这些设计挑战的结果是需要改进方法来识别、评估和减轻非标称行为。虽然所有行业都在寻求建立安全可靠的系统,但它们的故障仍以令人惊讶的规律不断出现在新闻中。这样的例子几乎无穷无尽。在63年(1957年至2019年)期间,航天工业发生了402次任务失败,包括卫星、载人飞船、火箭等。作为这些任务的一个子集,载人航天工业已经经历了118次失败,总共262人死亡[4];发生了5起载人飞行事件,其中19名宇航员死亡,8起训练或测试事件中有11名宇航员死亡,35起事件中共有232名非宇航员死亡(例如,平民、雇员等),70起事件(35起飞行事件和35起训练或测试事件)中没有发生死亡。除了402次任务失败外,还有118次卫星发射失败[4]。自1918年引入商业航空业以来,据报道有154,984人死亡[3]。自1970年以来,已经发生了11,634起事故。更令人担忧的是,随着时间的推移,年死亡率并没有下降多少。1970-2018年的年死亡率为1722人,1990-2018年的年死亡率为1337人。虽然这种情况有所减少,但大量事故继续造成该行业大量人员死亡。根据文献[5],已经发生了25次重大溃坝,其中16次发生在过去的50年里。核电行业已经发生了100多起事故,其中几起事故的缓解费用超过了10亿美元。值得注意的是,无论系统的类型、目的或年龄、系统所属的行业或设计和构建的时代如何,系统都会有规律地失败。我们对系统需求的持续增长总是胜过从业者评估和减轻非名义行为的能力。这些事实表明,失败总是近在眼前。除非我们对评估和减轻失败的方式做出重大改进,否则认为改变结果是不合理的。因此,评估的一个要素是了解涉及我们观察到的失败的各种原因。因此,本文试图通过失败的原因来描述失败的特征。这是通过调查来自几个不同相关行业的大量故障,然后得出故障原因的类别来完成的。确定了七种类型的失败,包括:开发失败、诱导失败、共同原因失败、传播失败、交互失败、恶意失败,以及管理、客户和滥用失败。通过了解复杂系统中可能存在的不同类型的故障,工程师可以更好地选择最适合特定系统的故障、风险和可靠性分析工具。这反过来可能导致更可靠的系统,在整个系统生命周期中更不容易出现故障。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
How Do Systems Fail?
Summary & ConclusionsModern systems are changing quickly and becoming more complex through increased connectivity, smaller packaging, higher performance requirements, more components, the inclusion of complex software and Artificial Intelligence (AI), and much more. The following are high-level challenges that arise in many modern systems. The first is the distribution of the system, which are both physical (e.g., power grids) and digital (e.g., air traffic control, transportation networks). With highly distributed system, the vulnerability from the environment becomes significant. The second challenge is the implementation of new technology where examples include driverless vehicles and Boeing’s 787 Dreamliner. Occasionally implementing new technology doesn’t lend well to their intended purpose as observed by the Supersonic Transport (SST) aircrafts for commercial flights such as Concorde [1] and the Tupolev Tu-144 [2]. This industry suffered a major crash, Air France Flight 4590, that killed 109 passengers and crew and led to the ultimate demise of the industry [3]. The result of these design challenges is the need for improved methods to identify, assess, and mitigate off-nominal behavior. While all industries seek to create safe and reliability systems, their failures continue to splash across the news with surprising regularity. The examples are nearly endless. Across 63 years (1957–2019) there have been 402 mission failures in the spaceflight industry including satellites, manned spacecrafts, rockets, etc. As a subset of these missions, the manned spaceflight industry has seen 118 failures with a total of 262 deaths [4]; there have been 5 manned flight incidents where 19 astronauts died, 8 training or testing incidents where 11 astronauts died, 35 incidents where a total of 232 non-astronauts died (e.g., civilians, employees, etc.), and 70 incidents (35 flight and 35 training or testing) where no deaths occurred. Beyond the 402 mission failures, there have also been 118 Satellite launch failures [4]. Since the introduction of the commercial airline industry in 1918, there have been a reported 154,984 deaths [3]. Since 1970, there have been 11,634 accidents. Even more alarming is that the annual death rate hasn’t decreased much with time. The death rate per year between 1970–2018 is 1722 and between 1990–2018 is 1337. While this has reduced, a large number of accidents continue to cause a large number of deaths in this industry. According to [5], there have been 25 major dam failures, 16 of which have occurred in the last 50 years. The nuclear power industry has observed over 100 failures, several of which have resulted mitigations exceeding a billion US dollars. It is important to note that systems fail with regularity regardless of the system’s type, purpose, or age, the industry that the system belongs, or the era in which it was designed and built. The continued increase in what we demand from our systems has always trumped the practitioner’s ability to assess and mitigate off-nominal behavior. These facts show that failure has always been imminent. Until significant improvements are made to the way that we assess and mitigate failures, it is unreasonable to consider the outcome to change. As such, one element of assessment is to understand the variety of causes that involve the failures we observe. As such, this paper seeks to characterize failures by their cause. This is done by surveying a large number of failures from several different relevant industries, then deriving categories of failure cause. Seven categories of failures are identified including: development failures, induced failures, common cause failures, propagated failures, interaction failures, malicious failures, and management, customer, and misuse failures. By understanding the different classes of failures potentially present in complex systems, engineers can better choose which failure, risk, and reliability analysis tools are most appropriate to use with specific systems. This in turn may lead to more reliable systems that are less prone to failure throughout the system lifecycle.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信