HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Distributed and Parallel Databases Pub Date : 2021-10-28 DOI:10.1007/s10619-021-07375-6

Xite Wang, Chaojin Wang, Mei Bai, Qian Ma, Guanyu Li

{"title":"HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce","authors":"Xite Wang, Chaojin Wang, Mei Bai, Qian Ma, Guanyu Li","doi":"10.1007/s10619-021-07375-6","DOIUrl":null,"url":null,"abstract":"<p>As one of the most popular parallel data processing models, data analysis system MapReduce has been widely used in many fields. Task scheduling is the core module in MapReduce system, and the quality of the scheduling algorithm directly affects the processing capacity of the system. Since new nodes need to be continuously added in the cluster to improve the processing capacity of the cluster, objectively, the heterogeneity of the cluster is caused. Heterogeneous environment is common in practical application scenarios, but there has been little research on task scheduling in heterogeneous environment. For this reason, this paper presents an in-depth study of task scheduling in heterogeneous environment and proposes a new task scheduling algorithm HTD. First, we give a formal definition of the throughput-driven task scheduling problem in a heterogeneous environment. Second, we design the scheduling algorithm HTD, which quickly obtains the completion sequence of a jobs set and optimizes the task scheduling details in heterogeneous environment. Finally, a series of experiments show the efficiency and effectiveness of the algorithm.</p>","PeriodicalId":50568,"journal":{"name":"Distributed and Parallel Databases","volume":"71 S102","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2021-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Distributed and Parallel Databases","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10619-021-07375-6","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

As one of the most popular parallel data processing models, data analysis system MapReduce has been widely used in many fields. Task scheduling is the core module in MapReduce system, and the quality of the scheduling algorithm directly affects the processing capacity of the system. Since new nodes need to be continuously added in the cluster to improve the processing capacity of the cluster, objectively, the heterogeneity of the cluster is caused. Heterogeneous environment is common in practical application scenarios, but there has been little research on task scheduling in heterogeneous environment. For this reason, this paper presents an in-depth study of task scheduling in heterogeneous environment and proposes a new task scheduling algorithm HTD. First, we give a formal definition of the throughput-driven task scheduling problem in a heterogeneous environment. Second, we design the scheduling algorithm HTD, which quickly obtains the completion sequence of a jobs set and optimizes the task scheduling details in heterogeneous environment. Finally, a series of experiments show the efficiency and effectiveness of the algorithm.

查看原文本刊更多论文

HTD: MapReduce中异构吞吐量驱动的任务调度算法

作为目前最流行的并行数据处理模型之一，数据分析系统MapReduce在许多领域得到了广泛的应用。任务调度是MapReduce系统的核心模块，调度算法的好坏直接影响到系统的处理能力。由于集群中需要不断增加新的节点来提高集群的处理能力，客观上造成了集群的异构性。异构环境在实际应用场景中很常见，但对异构环境下任务调度的研究却很少。为此，本文对异构环境下的任务调度问题进行了深入研究，提出了一种新的任务调度算法HTD。首先，给出了异构环境下吞吐量驱动任务调度问题的形式化定义。其次，设计了调度算法HTD，在异构环境下快速获取作业集的完成顺序，优化任务调度细节;最后，通过一系列实验验证了该算法的有效性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Distributed and Parallel Databases 工程技术-计算机：理论方法

CiteScore

3.50

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Distributed and Parallel Databases publishes papers in all the traditional as well as most emerging areas of database research, including: Availability and reliability; Benchmarking and performance evaluation, and tuning; Big Data Storage and Processing; Cloud Computing and Database-as-a-Service; Crowdsourcing; Data curation, annotation and provenance; Data integration, metadata Management, and interoperability; Data models, semantics, query languages; Data mining and knowledge discovery; Data privacy, security, trust; Data provenance, workflows, Scientific Data Management; Data visualization and interactive data exploration; Data warehousing, OLAP, Analytics; Graph data management, RDF, social networks; Information Extraction and Data Cleaning; Middleware and Workflow Management; Modern Hardware and In-Memory Database Systems; Query Processing and Optimization; Semantic Web and open data; Social Networks; Storage, indexing, and physical database design; Streams, sensor networks, and complex event processing; Strings, Texts, and Keyword Search; Spatial, temporal, and spatio-temporal databases; Transaction processing; Uncertain, probabilistic, and approximate databases.