A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks

IF 1.5 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Setareh Ahsaei, Mohsen Raji, Maryam Asadi Golmankhaneh
{"title":"A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks","authors":"Setareh Ahsaei,&nbsp;Mohsen Raji,&nbsp;Maryam Asadi Golmankhaneh","doi":"10.1002/cpe.70259","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 23-24","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70259","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.

分布式深度神经网络软错误弹性综合分析
与传统的集中式基于云的深度神经网络(dnn)相比,分布式深度神经网络(ddnn)已经成为一种很有前途的解决方案,通过在云、雾和边缘节点上分配计算工作量来提高深度学习任务的效率。虽然众所周知的软误差效应引起的模型参数变化已经显示出dnn的性能和可靠性的相当大的下降,但ddnn对这些影响的弹性仍未得到充分研究。本文对ddn网络的错误弹性进行了全面的分析,重点研究了软错误对网络各层的影响。使用Docker容器模拟真实场景,研究评估了在不同误码率(BER)下,在CIFAR-100和CIFAR-10数据集上训练的SqueezeNet和MobileNetV2模型。得到的结果表明,在一定的误码率下,错误会在ddnn的边缘节点中引入不确定性,而超过该误码率阈值,边缘节点会因故障而受到严重损害,导致错误决策的可能性很高。不确定性的增加导致决策过程转移到雾和云节点,从而导致响应时间的大幅增加。这项研究的见解不仅加深了我们对dddnn容错的理解,而且为创建更有弹性和更高效的分布式学习架构奠定了基础。通过利用基于docker的仿真,我们的方法提供了一个灵活且可重复的实验框架,可以适应该领域的进一步研究。此外,研究结果强调了对自适应策略的需求,这种策略可以智能地管理跨云、雾和边缘层的错误和计算资源。这些结果与自动驾驶汽车、工业物联网系统和智慧城市基础设施等时间敏感型应用特别相关,在这些应用中,ddn的可靠性和速度至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Concurrency and Computation-Practice & Experience
Concurrency and Computation-Practice & Experience 工程技术-计算机:理论方法
CiteScore
5.00
自引率
10.00%
发文量
664
审稿时长
9.6 months
期刊介绍: Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信