MeLoN:分布式深度学习与大数据平台的结合

2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C) Pub Date : 2021-09-01 DOI:10.1109/ACSOS-C52956.2021.00028

Dae-Cheol Kang, Seoungbeom Heo, Hyeounji Jang, Hyeock-Jin Lee, Minkyoung Cho, Jik-Soo Kim

{"title":"MeLoN:分布式深度学习与大数据平台的结合","authors":"Dae-Cheol Kang, Seoungbeom Heo, Hyeounji Jang, Hyeock-Jin Lee, Minkyoung Cho, Jik-Soo Kim","doi":"10.1109/ACSOS-C52956.2021.00028","DOIUrl":null,"url":null,"abstract":"Recent advancements in Artificial Intelligence have brought “Deep Learning” frameworks to be a cornerstone for the 4th Industrial Revolution along with “Big Data” platform technologies such as Apache Hadoop. However, efficient processing of deep learning applications has become challenging as the overall sizes of data and model increase rapidly. To address this problem, we can leverage big data platforms that have successfully provided stable storage and data processing capability during the past decade. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.","PeriodicalId":268224,"journal":{"name":"2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MeLoN: Distributed Deep Learning meets the Big Data Platform\",\"authors\":\"Dae-Cheol Kang, Seoungbeom Heo, Hyeounji Jang, Hyeock-Jin Lee, Minkyoung Cho, Jik-Soo Kim\",\"doi\":\"10.1109/ACSOS-C52956.2021.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advancements in Artificial Intelligence have brought “Deep Learning” frameworks to be a cornerstone for the 4th Industrial Revolution along with “Big Data” platform technologies such as Apache Hadoop. However, efficient processing of deep learning applications has become challenging as the overall sizes of data and model increase rapidly. To address this problem, we can leverage big data platforms that have successfully provided stable storage and data processing capability during the past decade. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.\",\"PeriodicalId\":268224,\"journal\":{\"name\":\"2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACSOS-C52956.2021.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACSOS-C52956.2021.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

人工智能的最新进展使“深度学习”框架与“大数据”平台技术(如Apache Hadoop)一起成为第四次工业革命的基石。然而，随着数据和模型的整体规模迅速增加，深度学习应用程序的有效处理变得具有挑战性。为了解决这个问题，我们可以利用过去十年来成功提供稳定存储和数据处理能力的大数据平台。在本文中，我们设计和实现了MeLoN (Multi-tenant dEep Learning framework On yarN)，它可以在大数据平台Hadoop之上有效地运行分布式深度学习应用程序。MeLoN将深度学习应用的预期GPU内存使用情况作为输入参数，并采用GPU超额分配策略，从而提高整体资源利用率。评估结果表明，在Hadoop集群中并发运行多个深度学习应用程序时，MeLoN可以提高整体系统吞吐量。MeLoN可以带来许多有趣的研究问题，涉及深度学习应用程序的预期GPU内存使用分析，深度学习处理的存储优化，支持基于排队系统的复杂深度学习相关工作，最终可以为基于yarn的Hadoop生态系统中的新数据处理框架做出贡献。在本文中，我们设计和实现了MeLoN (Multi-tenant dEep Learning framework On yarN)，它可以在大数据平台Hadoop之上有效地运行分布式深度学习应用程序。MeLoN将深度学习应用的预期GPU内存使用情况作为输入参数，并采用GPU超额分配策略，从而提高整体资源利用率。评估结果表明，在Hadoop集群中并发运行多个深度学习应用程序时，MeLoN可以提高整体系统吞吐量。MeLoN可以带来许多有趣的研究问题，涉及深度学习应用程序的预期GPU内存使用分析，深度学习处理的存储优化，支持基于排队系统的复杂深度学习相关工作，最终可以为基于yarn的Hadoop生态系统中的新数据处理框架做出贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MeLoN: Distributed Deep Learning meets the Big Data Platform

Recent advancements in Artificial Intelligence have brought “Deep Learning” frameworks to be a cornerstone for the 4th Industrial Revolution along with “Big Data” platform technologies such as Apache Hadoop. However, efficient processing of deep learning applications has become challenging as the overall sizes of data and model increase rapidly. To address this problem, we can leverage big data platforms that have successfully provided stable storage and data processing capability during the past decade. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem. In this paper, we present design and implementation of MeLoN (Multi-tenant dEep Learning framework On yarN) that can effectively run distributed deep learning applications on top of the big data platform Hadoop. MeLoN takes expected GPU memory usages of a deep learning application as an input parameter, and employs a GPU over-provisioning policy that can improve the overall resource utilization. Evaluation results show that MeLoN can improve the overall system throughput for concurrently running multiple deep learning applications in a Hadoop cluster. MeLoN can bring many interesting research issues related to profiling of expected GPU memory usages of deep learning applications, storage optimizations for deep learning processing, supporting complex deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)

自引率

0.00%

发文量