Marco Edoardo Santimaria , Iacopo Colonnelli , Barbara Cantalupo , Massimo Torquati , Doriana Medić , Nicola Tuccari , Eva Sciacca , Marco Aldinucci
{"title":"Dynamic transparent streaming in file-based workflows with CAPIO","authors":"Marco Edoardo Santimaria , Iacopo Colonnelli , Barbara Cantalupo , Massimo Torquati , Doriana Medić , Nicola Tuccari , Eva Sciacca , Marco Aldinucci","doi":"10.1016/j.future.2025.108159","DOIUrl":null,"url":null,"abstract":"<div><div>Advances in big data and the growth in complexity of modern applications highlight the necessity for optimizing workflow executions on different levels, such as hybrid workflow executions, automatic optimization of data movements, and efficient use of IO. Following this line, streaming features are the desired capabilities for file-based workflows as they can reduce overall execution times. Expanding workflows with streaming capabilities usually requires rewriting the application, which is time-consuming and requires deep knowledge of the application. With this work, we introduce the Cross-Application Programmable IO (CAPIO) methodology, of which the stack is composed of two parts: the CAPIO-CL coordination language and the CAPIO middleware (which implements the semantics expressed by the CAPIO-CL coordination language). The CAPIO-CL coordination language annotates synchronization semantics between files produced and consumed by workflow steps. At the same time, the CAPIO middleware improves the performance of file-based workflows, leveraging the information provided by the CAPIO-CL language while not having to change (recompile) the code of the original workflow steps. By design, the CAPIO middleware supports multiple backends and can be extended to support more. It is dynamic, and it supports dynamic job scheduling. Benchmarks, done on both microbenchmarks and real-life workflows, prove that with CAPIO, it is possible to reduce the workflow execution time by up to <span><math><mrow><mo>∼</mo><mn>50</mn><mo>%</mo></mrow></math></span>.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"176 ","pages":"Article 108159"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004534","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Advances in big data and the growth in complexity of modern applications highlight the necessity for optimizing workflow executions on different levels, such as hybrid workflow executions, automatic optimization of data movements, and efficient use of IO. Following this line, streaming features are the desired capabilities for file-based workflows as they can reduce overall execution times. Expanding workflows with streaming capabilities usually requires rewriting the application, which is time-consuming and requires deep knowledge of the application. With this work, we introduce the Cross-Application Programmable IO (CAPIO) methodology, of which the stack is composed of two parts: the CAPIO-CL coordination language and the CAPIO middleware (which implements the semantics expressed by the CAPIO-CL coordination language). The CAPIO-CL coordination language annotates synchronization semantics between files produced and consumed by workflow steps. At the same time, the CAPIO middleware improves the performance of file-based workflows, leveraging the information provided by the CAPIO-CL language while not having to change (recompile) the code of the original workflow steps. By design, the CAPIO middleware supports multiple backends and can be extended to support more. It is dynamic, and it supports dynamic job scheduling. Benchmarks, done on both microbenchmarks and real-life workflows, prove that with CAPIO, it is possible to reduce the workflow execution time by up to .
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.