FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2022-01-01 DOI:10.1109/WACV51458.2022.00194

Iljoo Baek, Wei Chen, Zhihao Zhu, Soheil Samii, R. Rajkumar

{"title":"FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication","authors":"Iljoo Baek, Wei Chen, Zhihao Zhu, Soheil Samii, R. Rajkumar","doi":"10.1109/WACV51458.2022.00194","DOIUrl":null,"url":null,"abstract":"Deep neural network (deepnet) applications play a crucial role in safety-critical systems such as autonomous vehicles (AVs). An AV must drive safely towards its destination, avoiding obstacles, and respond quickly when the vehicle must stop. Any transient errors in software calculations or hardware memory in these deepnet applications can potentially lead to dramatically incorrect results. Therefore, assessing and mitigating any transient errors and providing robust results are important for safety-critical systems. Previous research on this subject focused on detecting errors and then recovering from the errors by re-running the network. Other approaches were based on the extent of full network duplication such as the ensemble learning-based approach to boost system fault-tolerance by leveraging each model’s advantages. However, it is hard to detect errors in a deep neural network, and the computational overhead of full redundancy can be substantial.We first study the impact of the error types and locations in deepnets. We next focus on selecting which part should be duplicated using multiple ranking methods to measure the order of importance among neurons. We find that the duplication overhead for computation and memory is a trade-off between algorithmic performance and robustness. To achieve higher robustness with less system overhead, we present two error protection mechanisms that only duplicate parts of the network from critical neurons. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in the accuracy of a deepnet in the presence of errors. We demonstrate these results using a case study with real-world applications on an Nvidia GeForce RTX 2070Ti GPU and an Nvidia Xavier embedded platform used by automotive OEMs.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV51458.2022.00194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Deep neural network (deepnet) applications play a crucial role in safety-critical systems such as autonomous vehicles (AVs). An AV must drive safely towards its destination, avoiding obstacles, and respond quickly when the vehicle must stop. Any transient errors in software calculations or hardware memory in these deepnet applications can potentially lead to dramatically incorrect results. Therefore, assessing and mitigating any transient errors and providing robust results are important for safety-critical systems. Previous research on this subject focused on detecting errors and then recovering from the errors by re-running the network. Other approaches were based on the extent of full network duplication such as the ensemble learning-based approach to boost system fault-tolerance by leveraging each model’s advantages. However, it is hard to detect errors in a deep neural network, and the computational overhead of full redundancy can be substantial.We first study the impact of the error types and locations in deepnets. We next focus on selecting which part should be duplicated using multiple ranking methods to measure the order of importance among neurons. We find that the duplication overhead for computation and memory is a trade-off between algorithmic performance and robustness. To achieve higher robustness with less system overhead, we present two error protection mechanisms that only duplicate parts of the network from critical neurons. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in the accuracy of a deepnet in the presence of errors. We demonstrate these results using a case study with real-world applications on an Nvidia GeForce RTX 2070Ti GPU and an Nvidia Xavier embedded platform used by automotive OEMs.

查看原文本刊更多论文

FT-DeepNets:基于核复制的容错卷积神经网络

深度神经网络(deepnet)应用在自动驾驶汽车(AVs)等安全关键系统中发挥着至关重要的作用。自动驾驶汽车必须安全驶向目的地，避开障碍物，并在车辆必须停车时迅速做出反应。在这些深度网络应用程序中，软件计算或硬件内存中的任何短暂错误都可能导致严重错误的结果。因此，评估和减轻任何瞬态错误并提供可靠的结果对于安全关键系统非常重要。以往对该问题的研究主要集中在检测错误，然后通过重新运行网络从错误中恢复。其他方法基于完整网络复制的程度，例如基于集成学习的方法，通过利用每个模型的优势来提高系统容错性。然而，在深度神经网络中很难检测到错误，并且完全冗余的计算开销可能很大。我们首先研究了深度网络中误差类型和位置的影响。接下来，我们将重点关注使用多种排序方法来衡量神经元之间的重要性顺序，以选择应该复制的部分。我们发现计算和内存的重复开销是算法性能和鲁棒性之间的权衡。为了以更少的系统开销获得更高的鲁棒性，我们提出了两种错误保护机制，仅从关键神经元复制部分网络。最后，我们验证了我们的方法的实际可行性，并评估了在存在误差的情况下深度网络精度的提高。我们通过在汽车oem使用的Nvidia GeForce RTX 2070Ti GPU和Nvidia Xavier嵌入式平台上的实际应用案例研究来展示这些结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量