Analysis and Optimization of Data Import with Hadoop

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI:10.1109/IPDPSW.2012.129

Weijia Xu, Wei Luo, N. Woodward

{"title":"Analysis and Optimization of Data Import with Hadoop","authors":"Weijia Xu, Wei Luo, N. Woodward","doi":"10.1109/IPDPSW.2012.129","DOIUrl":null,"url":null,"abstract":"Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.

查看原文本刊更多论文

基于Hadoop的数据导入分析与优化

在越来越多的学科中，数据驱动研究已成为科学发现的重要组成部分。在许多情况下，要处理的大量数据不仅需要最先进的计算资源，还需要精心调整和专门开发的软件。这些需求通常与软件开发中的巨大操作成本和重要的专业知识相关联。由于其对用户的简单性和处理大数据的有效性，Hadoop已经成为一个流行的大规模数据分析软件平台。在远程共享基础设施中使用Hadoop集群使用户可以避免维护物理基础设施的成本。使用动态构造的Hadoop集群的一个不可避免的步骤是数据的初始导入。这个过程不是微不足道的，特别是当数据量很大的时候。在本文中，我们评估了将大规模数据导入Hadoop集群的成本。我们对Hadoop中的默认数据导入实现进行了详细的分析，并进行了实际评估。我们的评估包括使用不同硬件配置的测试，例如不同的网络协议和磁盘配置。为了提高数据导入Hadoop集群的性能，我们还提出了一种实现，在导入过程中，数据由data节点直接访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

自引率

0.00%

发文量