Adaptive incremental transfer learning for efficient performance modeling of big data workloads

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-01-26 DOI:10.1016/j.future.2025.107730

Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo

{"title":"Adaptive incremental transfer learning for efficient performance modeling of big data workloads","authors":"Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo","doi":"10.1016/j.future.2025.107730","DOIUrl":null,"url":null,"abstract":"<div><div>The rise of data-intensive scalable computing systems, such as Apache Spark, has transformed data processing by enabling the efficient manipulation of large datasets across machine clusters. However, system configuration to optimize performance remains a challenge. This paper introduces an adaptive incremental transfer learning approach to predicting workload execution times. By integrating both unsupervised and supervised learning, we develop models that adapt incrementally to new workloads and configurations. To guide the optimal selection of relevant workloads, the model employs the coefficient of distance variation (CdV) and the coefficient of quality correlation (CqC), combined in the exploration–exploitation balance coefficient (EEBC). Comprehensive evaluations demonstrate the robustness and reliability of our model for performance modeling in Spark applications, with average improvements of up to 31% over state-of-the-art methods. This research contributes to efficient performance tuning systems by enabling transfer learning from historical workloads to new, previously unseen workloads. The full source code is openly available.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107730"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25000251","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The rise of data-intensive scalable computing systems, such as Apache Spark, has transformed data processing by enabling the efficient manipulation of large datasets across machine clusters. However, system configuration to optimize performance remains a challenge. This paper introduces an adaptive incremental transfer learning approach to predicting workload execution times. By integrating both unsupervised and supervised learning, we develop models that adapt incrementally to new workloads and configurations. To guide the optimal selection of relevant workloads, the model employs the coefficient of distance variation (CdV) and the coefficient of quality correlation (CqC), combined in the exploration–exploitation balance coefficient (EEBC). Comprehensive evaluations demonstrate the robustness and reliability of our model for performance modeling in Spark applications, with average improvements of up to 31% over state-of-the-art methods. This research contributes to efficient performance tuning systems by enabling transfer learning from historical workloads to new, previously unseen workloads. The full source code is openly available.

Abstract Image

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.