{"title":"数据流错误恢复与检查点和指令级容错","authors":"Lei Xiong, QingPing Tan","doi":"10.1109/PDCAT.2011.33","DOIUrl":null,"url":null,"abstract":"Soft error detection and recovery are important to the system reliability, especially for the improvement of fabrication technology. Instruction-level soft error tolerance method which needs not additional hardware is broadly discussed. This paper proposes an application level data flow error recovery approach which combines the technique check pointing with instruction level fault tolerance method. On the instruction level, those codes are divided into protected codes and unprotected codes based on their sensibility to soft errors on hardware. For those protected codes, every data is copied with two versions. At some program points such as store instruction and branch instruction in the program, these related data are checked. If the two version data are not identical, we consider that there is a happened soft error. Then the program state is restored from a prior check point which is related to the error data. For a checked data, the check point which is related to the data is saved based on the program slice whose original program is from the beginning of the program to the checked data. Finally, the approach is implemented in our experiments, and experimental results demonstrate our approach.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Data Flow Error Recovery with Checkpointing and Instruction-Level Fault Tolerance\",\"authors\":\"Lei Xiong, QingPing Tan\",\"doi\":\"10.1109/PDCAT.2011.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Soft error detection and recovery are important to the system reliability, especially for the improvement of fabrication technology. Instruction-level soft error tolerance method which needs not additional hardware is broadly discussed. This paper proposes an application level data flow error recovery approach which combines the technique check pointing with instruction level fault tolerance method. On the instruction level, those codes are divided into protected codes and unprotected codes based on their sensibility to soft errors on hardware. For those protected codes, every data is copied with two versions. At some program points such as store instruction and branch instruction in the program, these related data are checked. If the two version data are not identical, we consider that there is a happened soft error. Then the program state is restored from a prior check point which is related to the error data. For a checked data, the check point which is related to the data is saved based on the program slice whose original program is from the beginning of the program to the checked data. Finally, the approach is implemented in our experiments, and experimental results demonstrate our approach.\",\"PeriodicalId\":137617,\"journal\":{\"name\":\"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2011.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data Flow Error Recovery with Checkpointing and Instruction-Level Fault Tolerance
Soft error detection and recovery are important to the system reliability, especially for the improvement of fabrication technology. Instruction-level soft error tolerance method which needs not additional hardware is broadly discussed. This paper proposes an application level data flow error recovery approach which combines the technique check pointing with instruction level fault tolerance method. On the instruction level, those codes are divided into protected codes and unprotected codes based on their sensibility to soft errors on hardware. For those protected codes, every data is copied with two versions. At some program points such as store instruction and branch instruction in the program, these related data are checked. If the two version data are not identical, we consider that there is a happened soft error. Then the program state is restored from a prior check point which is related to the error data. For a checked data, the check point which is related to the data is saved based on the program slice whose original program is from the beginning of the program to the checked data. Finally, the approach is implemented in our experiments, and experimental results demonstrate our approach.