A formal framework for fault tolerance in hybrid scientific workflows

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-10-10 DOI:10.1016/j.future.2025.108188

Alberto Mulone, Doriana Medic̀, Iacopo Colonnelli, Marco Aldinucci

{"title":"A formal framework for fault tolerance in hybrid scientific workflows","authors":"Alberto Mulone, Doriana Medic̀, Iacopo Colonnelli, Marco Aldinucci","doi":"10.1016/j.future.2025.108188","DOIUrl":null,"url":null,"abstract":"<div><div>In large-scale distributed systems, failures are routine events whose occurrences increase with the number of computational tasks and execution locations. The advantage of representing an application as a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability, scalability, and, crucially, reliability. Among these, reliability is essential for ensuring robust execution in dynamic and failure-prone environments. In recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure during the execution increased, creating a need for sophisticated fault tolerance mechanisms capable of addressing the specific requirements of hybrid systems. This work introduces a formal framework for a fault tolerance mechanism in hybrid workflows, enabling failure recovery through a rollback approach. The framework is rigorously defined by adapting and extending an existing workflow semantics tailored for hybrid execution. Our method leverages provenance data from workflow execution up to the point of failure, and creates a recovery workflow that spans multiple infrastructures. The rollback approach provides a robust and reliable strategy to ensure resilience against step failures and potential data loss. We then implement this mechanism in the StreamFlow WMS, and evaluate it using two case studies: the 1000 Genomes workflow and a synthetic workflow featuring iterative patterns. Experiments showcase the conceptual validity of our approach and assess the overhead introduced by the mechanism, including data availability checks.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"176 ","pages":"Article 108188"},"PeriodicalIF":6.2000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004820","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In large-scale distributed systems, failures are routine events whose occurrences increase with the number of computational tasks and execution locations. The advantage of representing an application as a workflow is the possibility of exploiting Workflow Management System (WMS) features such as portability, scalability, and, crucially, reliability. Among these, reliability is essential for ensuring robust execution in dynamic and failure-prone environments. In recent years, the emergence of hybrid workflows has posed new and intriguing challenges by increasing the possibility of distributing computations involving heterogeneous and independent environments. Consequently, the number of possible points of failure during the execution increased, creating a need for sophisticated fault tolerance mechanisms capable of addressing the specific requirements of hybrid systems. This work introduces a formal framework for a fault tolerance mechanism in hybrid workflows, enabling failure recovery through a rollback approach. The framework is rigorously defined by adapting and extending an existing workflow semantics tailored for hybrid execution. Our method leverages provenance data from workflow execution up to the point of failure, and creates a recovery workflow that spans multiple infrastructures. The rollback approach provides a robust and reliable strategy to ensure resilience against step failures and potential data loss. We then implement this mechanism in the StreamFlow WMS, and evaluate it using two case studies: the 1000 Genomes workflow and a synthetic workflow featuring iterative patterns. Experiments showcase the conceptual validity of our approach and assess the overhead introduced by the mechanism, including data availability checks.

查看原文本刊更多论文

混合科学工作流中容错的形式化框架

在大规模分布式系统中，故障是随着计算任务和执行位置的增加而增加的常规事件。将应用程序表示为工作流的优点是可以利用工作流管理系统（workflow Management System， WMS）的特性，如可移植性、可伸缩性，以及至关重要的可靠性。其中，可靠性对于确保在动态和易发生故障的环境中健壮地执行至关重要。近年来，混合工作流的出现增加了涉及异构和独立环境的分布式计算的可能性，从而提出了新的和有趣的挑战。因此，在执行过程中可能出现的故障点数量增加，从而需要能够满足混合系统特定需求的复杂容错机制。这项工作为混合工作流中的容错机制引入了一个正式框架，通过回滚方法实现故障恢复。该框架通过适应和扩展为混合执行量身定制的现有工作流语义来严格定义。我们的方法利用了从工作流执行到故障点的溯源数据，并创建了一个跨越多个基础结构的恢复工作流。回滚方法提供了一个健壮而可靠的策略，以确保对步骤失败和潜在的数据丢失具有弹性。然后，我们在StreamFlow WMS中实现该机制，并使用两个案例研究对其进行评估：1000 Genomes工作流和具有迭代模式的合成工作流。实验展示了我们的方法在概念上的有效性，并评估了该机制引入的开销，包括数据可用性检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.