Alberto Mulone, Doriana Medic̀, Iacopo Colonnelli, Marco Aldinucci
{"title":"A formal framework for fault tolerance in hybrid scientific workflows","authors":"Alberto Mulone, Doriana Medic̀, Iacopo Colonnelli, Marco Aldinucci","doi":"10.1016/j.future.2025.108188","DOIUrl":null,"url":null,"abstract":"<div><div>In large-scale distributed systems, failures are routine events whose occurrences increase with the number of computational tasks and execution locations. The advantage of representing an application as a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability, scalability, and, crucially, reliability. Among these, reliability is essential for ensuring robust execution in dynamic and failure-prone environments. In recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure during the execution increased, creating a need for sophisticated fault tolerance mechanisms capable of addressing the specific requirements of hybrid systems. This work introduces a formal framework for a fault tolerance mechanism in hybrid workflows, enabling failure recovery through a rollback approach. The framework is rigorously defined by adapting and extending an existing workflow semantics tailored for hybrid execution. Our method leverages provenance data from workflow execution up to the point of failure, and creates a recovery workflow that spans multiple infrastructures. The rollback approach provides a robust and reliable strategy to ensure resilience against step failures and potential data loss. We then implement this mechanism in the StreamFlow WMS, and evaluate it using two case studies: the 1000 Genomes workflow and a synthetic workflow featuring iterative patterns. Experiments showcase the conceptual validity of our approach and assess the overhead introduced by the mechanism, including data availability checks.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"176 ","pages":"Article 108188"},"PeriodicalIF":6.2000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004820","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In large-scale distributed systems, failures are routine events whose occurrences increase with the number of computational tasks and execution locations. The advantage of representing an application as a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability, scalability, and, crucially, reliability. Among these, reliability is essential for ensuring robust execution in dynamic and failure-prone environments. In recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure during the execution increased, creating a need for sophisticated fault tolerance mechanisms capable of addressing the specific requirements of hybrid systems. This work introduces a formal framework for a fault tolerance mechanism in hybrid workflows, enabling failure recovery through a rollback approach. The framework is rigorously defined by adapting and extending an existing workflow semantics tailored for hybrid execution. Our method leverages provenance data from workflow execution up to the point of failure, and creates a recovery workflow that spans multiple infrastructures. The rollback approach provides a robust and reliable strategy to ensure resilience against step failures and potential data loss. We then implement this mechanism in the StreamFlow WMS, and evaluate it using two case studies: the 1000 Genomes workflow and a synthetic workflow featuring iterative patterns. Experiments showcase the conceptual validity of our approach and assess the overhead introduced by the mechanism, including data availability checks.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.