{"title":"一种模式的转变即将到来——持续的失败","authors":"A. Geist","doi":"10.1109/CTS.2012.6261077","DOIUrl":null,"url":null,"abstract":"Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. This talk presents the factors that are driving an exponential increase in system fault rate. At the rate of increase, if the hardware and software are not fault tolerant at Exascale, then even relatively short-lived applications are unlikely to finish; or worse, the applications may complete with incorrect results. New paradigms must be developed for handling faults within both the system software and user applications. Also presented are new approaches for integrating detection algorithms in both the hardware and software and new techniques to help simulations adapt to faults.","PeriodicalId":200122,"journal":{"name":"2012 International Conference on Collaboration Technologies and Systems (CTS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A paradigm shift is coming - continuous failure\",\"authors\":\"A. Geist\",\"doi\":\"10.1109/CTS.2012.6261077\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. This talk presents the factors that are driving an exponential increase in system fault rate. At the rate of increase, if the hardware and software are not fault tolerant at Exascale, then even relatively short-lived applications are unlikely to finish; or worse, the applications may complete with incorrect results. New paradigms must be developed for handling faults within both the system software and user applications. Also presented are new approaches for integrating detection algorithms in both the hardware and software and new techniques to help simulations adapt to faults.\",\"PeriodicalId\":200122,\"journal\":{\"name\":\"2012 International Conference on Collaboration Technologies and Systems (CTS)\",\"volume\":\"113 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 International Conference on Collaboration Technologies and Systems (CTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CTS.2012.6261077\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 International Conference on Collaboration Technologies and Systems (CTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CTS.2012.6261077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. This talk presents the factors that are driving an exponential increase in system fault rate. At the rate of increase, if the hardware and software are not fault tolerant at Exascale, then even relatively short-lived applications are unlikely to finish; or worse, the applications may complete with incorrect results. New paradigms must be developed for handling faults within both the system software and user applications. Also presented are new approaches for integrating detection algorithms in both the hardware and software and new techniques to help simulations adapt to faults.