{"title":"科学计算中错误恢复的算法选择","authors":"Joseph Callenes-Sloan, H. McNamara","doi":"10.1109/PRDC.2014.20","DOIUrl":null,"url":null,"abstract":"With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm's error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.","PeriodicalId":187000,"journal":{"name":"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Algorithm Selection for Error Resilience in Scientific Computing\",\"authors\":\"Joseph Callenes-Sloan, H. McNamara\",\"doi\":\"10.1109/PRDC.2014.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm's error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.\",\"PeriodicalId\":187000,\"journal\":{\"name\":\"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRDC.2014.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2014.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Algorithm Selection for Error Resilience in Scientific Computing
With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm's error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.