SMARTH: Enabling Multi-pipeline Data Transfer in HDFS

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI:10.1109/ICPP.2014.12

Hong Zhang, Liqiang Wang, Hai Huang

{"title":"SMARTH: Enabling Multi-pipeline Data Transfer in HDFS","authors":"Hong Zhang, Liqiang Wang, Hai Huang","doi":"10.1109/ICPP.2014.12","DOIUrl":null,"url":null,"abstract":"Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop's most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS's synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of \"high performance\" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

Hadoop is a popular open-source implementation of the MapReduce programming model to handle large data sets, and HDFS is one of Hadoop's most commonly used distributed file systems. Surprisingly, we found that HDFS is inefficient when handling upload of data files from client local file system, especially when the storage cluster is configured to use replicas. The root cause is HDFS's synchronous pipeline design. In this paper, we introduce an improved HDFS design called SMARTH. It utilizes asynchronous multi-pipeline data transfers instead of a single pipeline stop-and-wait mechanism. SMARTH records the actual transfer speed of data blocks and sends this information to the namenode along with periodic heartbeat messages. The namenode sorts datanodes according to their past performance and tracks this information continuously. When a client initiates an upload request, the namenode will send it a list of "high performance" datanodes that it thinks will yield the highest throughput for the client. By choosing higher performance datanodes relative to each client and by taking advantage of the multi-pipeline design, our experiments show that SMARTH significantly improves the performance of data write operations compared to HDFS. Specifically, SMARTH is able to improve the throughput of data transfer by 27-245% in a heterogeneous virtual cluster on Amazon EC2.

查看原文本刊更多论文

SMARTH:在HDFS中启用多管道数据传输

Hadoop是MapReduce编程模型的流行开源实现，用于处理大型数据集，HDFS是Hadoop最常用的分布式文件系统之一。令人惊讶的是，我们发现当处理来自客户端本地文件系统的数据文件上传时，HDFS效率低下，特别是当存储集群配置为使用副本时。根本原因是HDFS的同步管道设计。在本文中，我们介绍了一个改进的HDFS设计，称为SMARTH。它利用异步多管道数据传输，而不是单一的管道停止和等待机制。SMARTH记录数据块的实际传输速度，并将此信息与周期性心跳消息一起发送到namenode。namenode根据数据节点过去的性能对它们进行排序，并持续跟踪这些信息。当客户端发起一个上传请求时，namenode将向它发送一个“高性能”数据节点列表，它认为这些节点将为客户端产生最高的吞吐量。通过选择相对于每个客户端的更高性能的数据节点，并利用多管道设计，我们的实验表明，与HDFS相比，SMARTH显著提高了数据写入操作的性能。具体来说，SMARTH能够在Amazon EC2上的异构虚拟集群中将数据传输吞吐量提高27-245%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 43rd International Conference on Parallel Processing

自引率

0.00%

发文量