On the Power of In-Network Caching in the Hadoop Distributed File System

Proceedings of the 6th ACM Conference on Information-Centric Networking Pub Date : 2019-09-24 DOI:10.1145/3357150.3357392

Eric Newberry, Beichuan Zhang

{"title":"On the Power of In-Network Caching in the Hadoop Distributed File System","authors":"Eric Newberry, Beichuan Zhang","doi":"10.1145/3357150.3357392","DOIUrl":null,"url":null,"abstract":"The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFS-supported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.","PeriodicalId":112463,"journal":{"name":"Proceedings of the 6th ACM Conference on Information-Centric Networking","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th ACM Conference on Information-Centric Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357150.3357392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFS-supported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.

查看原文本刊更多论文

论Hadoop分布式文件系统的网内缓存功能

HDFS (Hadoop Distributed File System)是一个网络文件系统，用于支持多个广泛使用的大数据框架，可以扩展到大型集群上运行。在本文中，我们评估了在支持hdfs集群的交换机上使用网络内缓存的有效性，以减少网络中的每链路带宽使用。我们发现，一些应用程序具有多个客户机请求的大量数据，通过在网络中缓存读取数据，这些应用程序中读取操作的平均每链路带宽使用可以减少一半以上。我们还发现，在这种环境中，缓存替换策略的选择可能会对缓存效率产生重大影响，LIRS和ARC通常分别在缓存大小较大和较小的情况下表现最佳。此外，考虑到HDFS写操作的结构，我们开发了一种机制，通过用多播代替写管道来减少HDFS写操作的总每链路带宽使用。为了评估网络内缓存的潜力，我们开发了一个模拟器，通过模拟命名数据网络(NDN)信息中心网络(ICN)架构中使用的缓存架构的胖树网络重播真实轨迹。我们的研究结果表明，ICN风格的网络内缓存可以为hdfs支持的大数据集群提供显著的好处，证明了未来将ICN架构应用于集群环境的工作是合理的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 6th ACM Conference on Information-Centric Networking

自引率

0.00%

发文量