Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi
{"title":"通过瞬时故障感知设计和训练提高深度神经网络可靠性","authors":"Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi","doi":"10.1109/TETC.2024.3520672","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name <italic>DieHardNet</i> the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"829-840"},"PeriodicalIF":5.4000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training\",\"authors\":\"Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi\",\"doi\":\"10.1109/TETC.2024.3520672\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name <italic>DieHardNet</i> the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"13 3\",\"pages\":\"829-840\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836186/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836186/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training
Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.
期刊介绍:
IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.