通过瞬时故障感知设计和训练提高深度神经网络可靠性

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Emerging Topics in Computing Pub Date : 2025-01-10 DOI:10.1109/TETC.2024.3520672

Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi

{"title":"通过瞬时故障感知设计和训练提高深度神经网络可靠性","authors":"Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi","doi":"10.1109/TETC.2024.3520672","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name <italic>DieHardNet</i> the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"13 3","pages":"829-840"},"PeriodicalIF":5.4000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training\",\"authors\":\"Fernando Fernandes dos Santos;Niccolò Cavagnero;Marco Ciccone;Giuseppe Averta;Angeliki Kritikakou;Olivier Sentieys;Paolo Rech;Tatiana Tommasi\",\"doi\":\"10.1109/TETC.2024.3520672\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name <italic>DieHardNet</i> the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"13 3\",\"pages\":\"829-840\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836186/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836186/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

深度神经网络（dnn）已经彻底改变了几个领域，包括安全和关键任务应用，如自动驾驶和太空探索。然而，最近的研究强调，瞬态硬件故障会破坏模型的输出，导致高的错误预测概率。由于基于模块化硬件、软件复制或矩阵乘法校验和的传统可靠性策略带来了很高的开销，因此迫切需要为深度神经网络量身定制高效且有效的强化解决方案。在本文中，我们提出了几种网络设计选择和一个训练过程，以提高标准深度模型的鲁棒性，并通过视觉分类任务的实验分析彻底评估这些策略。我们将DieHardNet命名为专门的深度神经网络，该深度神经网络通过应用我们所有的强化技术获得，这些技术结合了实验硬件故障表征和机器学习研究的知识。我们进行了广泛的消融研究，以量化DieHardNet中每个硬化组件的可靠性增益。我们执行了超过10,000个指令级故障注入来验证我们的方法，并将在gpu上执行的DieHardNet暴露在相当于超过570,000年自然辐射的加速中子束中。我们的评估表明，与未受保护的基线模型相比，DieHardNet可以将临界错误率（即修改推理的错误）降低多达100倍，而不会导致推理时间的增加。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Deep Neural Network Reliability via Transient-Fault-Aware Design and Training

Deep Neural Networks (DNNs) have revolutionized several fields, including safety- and mission-critical applications, such as autonomous driving and space exploration. However, recent studies have highlighted that transient hardware faults can corrupt the model's output, leading to high misprediction probabilities. Since traditional reliability strategies, based on modular hardware, software replications, or matrix multiplication checksum impose a high overhead, there is a pressing need for efficient and effective hardening solutions tailored for DNNs. In this article we present several network design choices and a training procedure that increase the robustness of standard deep models and thoroughly evaluate these strategies with experimental analyses on vision classification tasks. We name DieHardNet the specialized DNN obtained by applying all our hardening techniques that combine knowledge from experimental hardware faults characterization and machine learning studies. We conduct extensive ablation studies to quantify the reliability gain of each hardening component in DieHardNet. We perform over 10,000 instruction-level fault injections to validate our approach and expose DieHardNet executed on GPUs to an accelerated neutron beam equivalent to more than 570,000 years of natural radiation. Our evaluation demonstrates that DieHardNet can reduce the critical error rate (i.e., errors that modify the inference) up to 100 times compared to the unprotected baseline model, without causing any increase in inference time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computing Computer Science-Computer Science (miscellaneous)

CiteScore

12.10

自引率

5.10%

发文量

113

期刊介绍： IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.