{"title":"多核信息物理系统中不完全故障检测的动态任务复制","authors":"Hossein Hosseini;Mohsen Ansari;Jörg Henkel","doi":"10.1109/TETC.2025.3572277","DOIUrl":null,"url":null,"abstract":"Task replication is a common technique for achieving fault tolerance. However, its effectiveness is limited by the accuracy of the fault detection mechanism; imperfect detection imposes a ceiling on achievable reliability. While perfect fault detection mechanisms offer higher reliability, they introduce significant overhead. To address this, we introduce Dynamic Task Replication, a fault tolerance technique that dynamically determines the number of replicas at runtime to overcome the limitations of imperfect fault detection. Our primary contribution, Reliability-Aware Replica-Efficient Dynamic Task Replication, optimizes this approach by minimizing the expected number of replicas while achieving the desired reliability target. We incorporate actual execution times into the reliability assessment. Additionally, we propose the Energy-Aware Reliability-Guaranteeing scheduling technique, which integrates our optimized replication method into hard real-time systems and leverages Dynamic Voltage and Frequency Scaling to minimize energy consumption while ensuring reliability and system schedulability. Experimental results demonstrate that our method requires 24% fewer replicas on average than the N-Modular Redundancy technique, with the advantage increasing to 58% for tasks with low base reliabilities. Furthermore, our scheduling technique significantly conserves energy and enhances feasibility compared to existing methods across diverse system workloads.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"1113-1129"},"PeriodicalIF":5.4000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic Task Replication With Imperfect Fault Detection in Multicore Cyber-Physical Systems\",\"authors\":\"Hossein Hosseini;Mohsen Ansari;Jörg Henkel\",\"doi\":\"10.1109/TETC.2025.3572277\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Task replication is a common technique for achieving fault tolerance. However, its effectiveness is limited by the accuracy of the fault detection mechanism; imperfect detection imposes a ceiling on achievable reliability. While perfect fault detection mechanisms offer higher reliability, they introduce significant overhead. To address this, we introduce Dynamic Task Replication, a fault tolerance technique that dynamically determines the number of replicas at runtime to overcome the limitations of imperfect fault detection. Our primary contribution, Reliability-Aware Replica-Efficient Dynamic Task Replication, optimizes this approach by minimizing the expected number of replicas while achieving the desired reliability target. We incorporate actual execution times into the reliability assessment. Additionally, we propose the Energy-Aware Reliability-Guaranteeing scheduling technique, which integrates our optimized replication method into hard real-time systems and leverages Dynamic Voltage and Frequency Scaling to minimize energy consumption while ensuring reliability and system schedulability. Experimental results demonstrate that our method requires 24% fewer replicas on average than the N-Modular Redundancy technique, with the advantage increasing to 58% for tasks with low base reliabilities. Furthermore, our scheduling technique significantly conserves energy and enhances feasibility compared to existing methods across diverse system workloads.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"13 3\",\"pages\":\"1113-1129\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11017438/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11017438/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Dynamic Task Replication With Imperfect Fault Detection in Multicore Cyber-Physical Systems
Task replication is a common technique for achieving fault tolerance. However, its effectiveness is limited by the accuracy of the fault detection mechanism; imperfect detection imposes a ceiling on achievable reliability. While perfect fault detection mechanisms offer higher reliability, they introduce significant overhead. To address this, we introduce Dynamic Task Replication, a fault tolerance technique that dynamically determines the number of replicas at runtime to overcome the limitations of imperfect fault detection. Our primary contribution, Reliability-Aware Replica-Efficient Dynamic Task Replication, optimizes this approach by minimizing the expected number of replicas while achieving the desired reliability target. We incorporate actual execution times into the reliability assessment. Additionally, we propose the Energy-Aware Reliability-Guaranteeing scheduling technique, which integrates our optimized replication method into hard real-time systems and leverages Dynamic Voltage and Frequency Scaling to minimize energy consumption while ensuring reliability and system schedulability. Experimental results demonstrate that our method requires 24% fewer replicas on average than the N-Modular Redundancy technique, with the advantage increasing to 58% for tasks with low base reliabilities. Furthermore, our scheduling technique significantly conserves energy and enhances feasibility compared to existing methods across diverse system workloads.
期刊介绍:
IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.