Rhauani Weber Aita Fazul, Odorico Machado Mendizabal, Patrícia Pitthan Barcelos
{"title":"DARB: A Dynamic Architecture for Data Replica Balancing","authors":"Rhauani Weber Aita Fazul, Odorico Machado Mendizabal, Patrícia Pitthan Barcelos","doi":"10.1002/cpe.70050","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Distributed file systems, such as HDFS, are designed to support applications that handle large volumes of data. Data replication, which is at the core of the HDFS storage model, is essential for fault tolerance and performance. As new data are loaded into the system, the distribution of data blocks replicated among the nodes may become dissimilar affecting replica balancing and data locality. The HDFS Balancer is the official solution for redistributing the data already stored in the cluster. However, it overlooks the specific needs of the applications during data rearrangement and requires manual intervention by system administrators—a dependency that is often inadequate and inefficient. To address these limitations, this work presents DARB, a Dynamic Architecture for Replica Balancing that combines reactive and proactive strategies. The former uses the Prioritized Replica Balancing Policy to customize the replica balancing through configurable priorities. The latter consists of an event-driven strategy that makes the overall balancing process in HDFS transparent. DARB comprises modular components and a metrics observation model that identifies and determines when corrective actions should be taken. It also automatically triggers the HDFS Balancer based on standardized trigger events. The evaluation results reinforce that the proposed solution removes the need for manual configuration and execution while actively acting to keep the cluster balanced, taking into account performance, reliability, and data availability perspectives. Thus, DARB offers a sophisticated and specialized balancing solution that makes the balancing process seamless and flexible, introducing to the HDFS the concept of context-aware replica balancing.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 9-11","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70050","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed file systems, such as HDFS, are designed to support applications that handle large volumes of data. Data replication, which is at the core of the HDFS storage model, is essential for fault tolerance and performance. As new data are loaded into the system, the distribution of data blocks replicated among the nodes may become dissimilar affecting replica balancing and data locality. The HDFS Balancer is the official solution for redistributing the data already stored in the cluster. However, it overlooks the specific needs of the applications during data rearrangement and requires manual intervention by system administrators—a dependency that is often inadequate and inefficient. To address these limitations, this work presents DARB, a Dynamic Architecture for Replica Balancing that combines reactive and proactive strategies. The former uses the Prioritized Replica Balancing Policy to customize the replica balancing through configurable priorities. The latter consists of an event-driven strategy that makes the overall balancing process in HDFS transparent. DARB comprises modular components and a metrics observation model that identifies and determines when corrective actions should be taken. It also automatically triggers the HDFS Balancer based on standardized trigger events. The evaluation results reinforce that the proposed solution removes the need for manual configuration and execution while actively acting to keep the cluster balanced, taking into account performance, reliability, and data availability perspectives. Thus, DARB offers a sophisticated and specialized balancing solution that makes the balancing process seamless and flexible, introducing to the HDFS the concept of context-aware replica balancing.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.