A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-09-03 DOI:10.1002/cpe.70259

Setareh Ahsaei, Mohsen Raji, Maryam Asadi Golmankhaneh

{"title":"A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks","authors":"Setareh Ahsaei, Mohsen Raji, Maryam Asadi Golmankhaneh","doi":"10.1002/cpe.70259","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 23-24","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70259","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.

查看原文本刊更多论文

分布式深度神经网络软错误弹性综合分析

与传统的集中式基于云的深度神经网络（dnn）相比，分布式深度神经网络（ddnn）已经成为一种很有前途的解决方案，通过在云、雾和边缘节点上分配计算工作量来提高深度学习任务的效率。虽然众所周知的软误差效应引起的模型参数变化已经显示出dnn的性能和可靠性的相当大的下降，但ddnn对这些影响的弹性仍未得到充分研究。本文对ddn网络的错误弹性进行了全面的分析，重点研究了软错误对网络各层的影响。使用Docker容器模拟真实场景，研究评估了在不同误码率（BER）下，在CIFAR-100和CIFAR-10数据集上训练的SqueezeNet和MobileNetV2模型。得到的结果表明，在一定的误码率下，错误会在ddnn的边缘节点中引入不确定性，而超过该误码率阈值，边缘节点会因故障而受到严重损害，导致错误决策的可能性很高。不确定性的增加导致决策过程转移到雾和云节点，从而导致响应时间的大幅增加。这项研究的见解不仅加深了我们对dddnn容错的理解，而且为创建更有弹性和更高效的分布式学习架构奠定了基础。通过利用基于docker的仿真，我们的方法提供了一个灵活且可重复的实验框架，可以适应该领域的进一步研究。此外，研究结果强调了对自适应策略的需求，这种策略可以智能地管理跨云、雾和边缘层的错误和计算资源。这些结果与自动驾驶汽车、工业物联网系统和智慧城市基础设施等时间敏感型应用特别相关，在这些应用中，ddn的可靠性和速度至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.