{"title":"分布式深度神经网络软错误弹性综合分析","authors":"Setareh Ahsaei, Mohsen Raji, Maryam Asadi Golmankhaneh","doi":"10.1002/cpe.70259","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 23-24","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks\",\"authors\":\"Setareh Ahsaei, Mohsen Raji, Maryam Asadi Golmankhaneh\",\"doi\":\"10.1002/cpe.70259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 23-24\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70259\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70259","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
A Comprehensive Soft Error Resiliency Analysis of Distributed Deep Neural Networks
Distributed deep neural networks (DDNNs) have emerged as a promising solution to enhance the efficiency of deep learning tasks compared to traditional centralized cloud-based Deep Neural Networks (DNNs) by distributing the computational workload across cloud, fog, and edge nodes. Although model parameter changes caused by the well-known soft error effects have shown considerable degradation in the performance and reliability of DNNs, the resiliency of DDNNs against these effects is still understudied. This paper conducts a comprehensive analysis of the error resiliency of DDNNs, focusing on the impact of soft errors at various network layers. Using Docker containers to emulate real-world scenarios, the study evaluates SqueezeNet and MobileNetV2 models trained on CIFAR-100 and CIFAR-10 datasets under varying bit error rates (BER). The obtained results demonstrate that up to a certain BER, errors introduce uncertainty in the edge node of DDNNs while beyond this BER threshold, the edge node becomes significantly compromised due to faults, leading to a high likelihood of false decisions. Increasing uncertainty causes the decision-making process to shift to the fog and cloud nodes, leading to a considerable increase in response time. The insights from this study not only deepen our understanding of fault tolerance in DDNNs but also lay the groundwork for creating more resilient and efficient distributed learning architectures. By utilizing Docker-based emulation, our approach provides a flexible and reproducible experimental framework that can be adapted for further studies in this area. Additionally, the findings highlight the need for adaptive strategies that can intelligently manage errors and computational resources across cloud, fog, and edge layers. These results are particularly relevant for time-sensitive applications like autonomous vehicles, industrial IoT systems, and smart city infrastructures, where the reliability and speed of DDNNs are critical.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.