Hadoop on Named Data Networking: Experience and Results

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems Pub Date : 2017-06-05 DOI:10.1145/3078505.3078508

Mathias Gibbens, C. Gniady, Lei Ye, Beichuan Zhang

{"title":"Hadoop on Named Data Networking: Experience and Results","authors":"Mathias Gibbens, C. Gniady, Lei Ye, Beichuan Zhang","doi":"10.1145/3078505.3078508","DOIUrl":null,"url":null,"abstract":"In today's data centers, clusters of servers are arranged to perform various tasks in a massively distributed manner: handling web requests, processing scientific data, and running simulations of real-world problems. These clusters are very complex, and require a significant amount of planning and administration to ensure that they perform to their maximum potential. Planning and configuration can be a long and complicated process; once completed it is hard to completely re-architect an existing cluster. In addition to planning the physical hardware, the software must also be properly configured to run on a cluster. Information such as which server is in which rack and the total network bandwidth between rows of racks constrain the placement of jobs scheduled to run on a cluster. Some software may be able to use hints provided by a user about where to schedule jobs, while others may simply place them randomly and hope for the best. Every cluster has at least one bottleneck that constrains the overall performance to less than the optimal that may be achieved on paper. One common bottleneck is the speed of the network: communication between servers in a rack may be unable to saturate their network connections, but traffic flowing between racks or rows in a data center can easily overwhelm the interconnect switches. Various network topologies have been proposed to help mitigate this problem by providing multiple paths between points in the network, but they all suffer from the same fundamental problem: it is cost-prohibitive to build a network that can provide concurrent full network bandwidth between all servers. Researchers have been working on developing new network protocols that can make more efficient use of existing network hardware through a blurring of the line between network layer and applications. One of the most well-known examples of this is Named Data Networking (NDN), a data-centric network architecture that has been in development for several years. While NDN has received significant attention for wide-area Internet, a detailed understanding of NDN benefits and challenges in the data center environment has been lacking. The Named Data Networking architecture retrieves content by names rather than connecting to specific hosts. It provides benefits such as highly efficient and resilient content distribution, which fit well to data-intensive distributed computing. This paper presents and discusses our experience in modifying Apache Hadoop, a popular MapReduce framework, to operate on an NDN network. Through this first-of-its-kind implementation process, we demonstrate the feasibility of running an existing, large, and complex piece of distributed software commonly seen in data centers over NDN. We show advantages such as simplified network code and reduced network traffic, which are beneficial in a data center environment. There are also challenges faced by NDN that are being addressed by the community, which can be magnified under data center traffic. Through detailed evaluation, we show a reduction of 16% for overall data transmission between Hadoop nodes while writing data with default replication settings. Preliminary results also show promise for in-network caching of repeated reads in distributed applications. We show that while overall performance is currently slower under NDN, there are challenges and opportunities for further NDN improvements.","PeriodicalId":133673,"journal":{"name":"Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078505.3078508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In today's data centers, clusters of servers are arranged to perform various tasks in a massively distributed manner: handling web requests, processing scientific data, and running simulations of real-world problems. These clusters are very complex, and require a significant amount of planning and administration to ensure that they perform to their maximum potential. Planning and configuration can be a long and complicated process; once completed it is hard to completely re-architect an existing cluster. In addition to planning the physical hardware, the software must also be properly configured to run on a cluster. Information such as which server is in which rack and the total network bandwidth between rows of racks constrain the placement of jobs scheduled to run on a cluster. Some software may be able to use hints provided by a user about where to schedule jobs, while others may simply place them randomly and hope for the best. Every cluster has at least one bottleneck that constrains the overall performance to less than the optimal that may be achieved on paper. One common bottleneck is the speed of the network: communication between servers in a rack may be unable to saturate their network connections, but traffic flowing between racks or rows in a data center can easily overwhelm the interconnect switches. Various network topologies have been proposed to help mitigate this problem by providing multiple paths between points in the network, but they all suffer from the same fundamental problem: it is cost-prohibitive to build a network that can provide concurrent full network bandwidth between all servers. Researchers have been working on developing new network protocols that can make more efficient use of existing network hardware through a blurring of the line between network layer and applications. One of the most well-known examples of this is Named Data Networking (NDN), a data-centric network architecture that has been in development for several years. While NDN has received significant attention for wide-area Internet, a detailed understanding of NDN benefits and challenges in the data center environment has been lacking. The Named Data Networking architecture retrieves content by names rather than connecting to specific hosts. It provides benefits such as highly efficient and resilient content distribution, which fit well to data-intensive distributed computing. This paper presents and discusses our experience in modifying Apache Hadoop, a popular MapReduce framework, to operate on an NDN network. Through this first-of-its-kind implementation process, we demonstrate the feasibility of running an existing, large, and complex piece of distributed software commonly seen in data centers over NDN. We show advantages such as simplified network code and reduced network traffic, which are beneficial in a data center environment. There are also challenges faced by NDN that are being addressed by the community, which can be magnified under data center traffic. Through detailed evaluation, we show a reduction of 16% for overall data transmission between Hadoop nodes while writing data with default replication settings. Preliminary results also show promise for in-network caching of repeated reads in distributed applications. We show that while overall performance is currently slower under NDN, there are challenges and opportunities for further NDN improvements.

查看原文本刊更多论文

Hadoop在命名数据网络:经验和结果

在今天的数据中心中，服务器集群被安排以大规模分布式的方式执行各种任务:处理web请求、处理科学数据和运行现实世界问题的模拟。这些集群非常复杂，需要进行大量的规划和管理，以确保它们发挥最大的潜力。规划和配置可能是一个漫长而复杂的过程;一旦完成，就很难完全重新构建现有集群。除了规划物理硬件之外，还必须正确配置软件以在集群上运行。诸如哪个服务器位于哪个机架以及机架行之间的总网络带宽等信息约束了计划在集群上运行的作业的位置。一些软件可能能够使用用户提供的关于在哪里安排作业的提示，而其他软件可能只是随机地放置它们，并希望得到最好的结果。每个集群都至少有一个瓶颈，它将整体性能限制在低于理论上可能实现的最优性能。一个常见的瓶颈是网络速度:机架中的服务器之间的通信可能无法使其网络连接饱和，但数据中心中机架或行之间的流量很容易使互连交换机不堪重负。已经提出了各种网络拓扑，通过在网络中的点之间提供多条路径来帮助缓解这个问题，但是它们都面临着相同的基本问题:构建一个可以在所有服务器之间提供并发全网络带宽的网络的成本过高。研究人员一直致力于开发新的网络协议，通过模糊网络层和应用程序之间的界限，更有效地利用现有的网络硬件。这方面最著名的例子之一是命名数据网络(NDN)，这是一种以数据为中心的网络架构，已经开发了好几年。虽然NDN在广域互联网中受到了极大的关注，但对NDN在数据中心环境中的优势和挑战却缺乏详细的了解。命名数据网络体系结构按名称检索内容，而不是连接到特定的主机。它提供了诸如高效和弹性的内容分发等优点，非常适合数据密集型分布式计算。本文介绍并讨论了我们修改Apache Hadoop(一个流行的MapReduce框架)以在NDN网络上运行的经验。通过这种首创的实现过程，我们演示了在NDN数据中心运行现有的、大型的、复杂的分布式软件的可行性。我们展示了简化网络代码和减少网络流量等优点，这些优点在数据中心环境中是有益的。NDN面临的挑战也正在被社区所解决，这些挑战在数据中心流量下可能会被放大。通过详细的评估，我们发现在使用默认复制设置写入数据时，Hadoop节点之间的总体数据传输减少了16%。初步结果还显示了分布式应用程序中重复读取的网络内缓存的前景。我们表明，虽然目前在NDN下整体性能较慢，但进一步改进NDN存在挑战和机遇。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems

自引率

0.00%

发文量