A review of major ICT failures and recovery strategies: Strengthening digital resilience

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2025-09-22 DOI:10.1016/j.cose.2025.104678

Amr Adel , Noor H.S. Alani , Tony Jan , Mukesh Prasad

{"title":"A review of major ICT failures and recovery strategies: Strengthening digital resilience","authors":"Amr Adel , Noor H.S. Alani , Tony Jan , Mukesh Prasad","doi":"10.1016/j.cose.2025.104678","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a comprehensive, cross-sector analysis of large-scale ICT failures to address the persistent gap in understanding how systemic digital breakdowns occur and propagate across platforms and industries. Through a comparative study of seven major global outages (2019–2024) — selected based on scale, technical transparency, and platform diversity — we identify recurring vulnerabilities in automation governance, configuration management, centralized infrastructure, and incident response. Using a custom analytical framework grounded in socio-technical and resilience engineering theory, the paper maps failure propagation patterns and derives a taxonomy of technical and organizational failure modes.</div><div>We empirically validate a suite of resilience strategies — including rollback automation, configuration-as-code, SOAR-enabled response orchestration, and chaos engineering — and demonstrate how they address failure propagation pathways observed in real-world incidents. A conceptual model for decentralized system upgrade planning is introduced, incorporating microservice segmentation, dependency mapping, and AI-assisted fault containment. The paper culminates in a forward-looking digital resilience roadmap that integrates predictive analytics, secure software supply chains, and adaptive human–machine collaboration. Core contributions include: (1) a cross-case classification of failure archetypes, (2) evidence-based design patterns for resilience, and (3) actionable frameworks for infrastructure operators and researchers working towards next-generation ICT robustness.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"159 ","pages":"Article 104678"},"PeriodicalIF":5.4000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404825003670","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a comprehensive, cross-sector analysis of large-scale ICT failures to address the persistent gap in understanding how systemic digital breakdowns occur and propagate across platforms and industries. Through a comparative study of seven major global outages (2019–2024) — selected based on scale, technical transparency, and platform diversity — we identify recurring vulnerabilities in automation governance, configuration management, centralized infrastructure, and incident response. Using a custom analytical framework grounded in socio-technical and resilience engineering theory, the paper maps failure propagation patterns and derives a taxonomy of technical and organizational failure modes.

We empirically validate a suite of resilience strategies — including rollback automation, configuration-as-code, SOAR-enabled response orchestration, and chaos engineering — and demonstrate how they address failure propagation pathways observed in real-world incidents. A conceptual model for decentralized system upgrade planning is introduced, incorporating microservice segmentation, dependency mapping, and AI-assisted fault containment. The paper culminates in a forward-looking digital resilience roadmap that integrates predictive analytics, secure software supply chains, and adaptive human–machine collaboration. Core contributions include: (1) a cross-case classification of failure archetypes, (2) evidence-based design patterns for resilience, and (3) actionable frameworks for infrastructure operators and researchers working towards next-generation ICT robustness.

查看原文本刊更多论文

主要信息通信技术故障和恢复战略综述：加强数字复原力

本文对大规模ICT故障进行了全面的跨部门分析，以解决在理解系统性数字故障如何发生并跨平台和行业传播方面的持续差距。通过对七次主要全球中断（2019-2024）的比较研究（根据规模、技术透明度和平台多样性进行选择），我们确定了自动化治理、配置管理、集中式基础设施和事件响应中反复出现的漏洞。使用基于社会技术和弹性工程理论的定制分析框架，本文绘制了故障传播模式，并派生了技术和组织故障模式的分类。我们通过经验验证了一套弹性策略——包括回滚自动化、配置即代码、支持soa的响应编排和混沌工程——并演示了它们如何处理在现实事件中观察到的故障传播路径。介绍了分散系统升级规划的概念模型，该模型结合了微服务分割、依赖映射和人工智能辅助的故障控制。该论文最终提出了一个前瞻性的数字弹性路线图，该路线图集成了预测分析、安全软件供应链和自适应人机协作。核心贡献包括：(1)故障原型的跨案例分类，(2)基于证据的弹性设计模式，以及(3)为致力于下一代ICT健壮性的基础设施运营商和研究人员提供的可操作框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.