Infrastructure-Aware TensorFlow for Heterogeneous Datacenters

Moiz Arif, M. M. Rafique, Seung-Hwan Lim, Zaki Malik
{"title":"Infrastructure-Aware TensorFlow for Heterogeneous Datacenters","authors":"Moiz Arif, M. M. Rafique, Seung-Hwan Lim, Zaki Malik","doi":"10.1109/MASCOTS50786.2020.9285969","DOIUrl":null,"url":null,"abstract":"Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.
异构数据中心的基础设施感知TensorFlow
具有各种计算、内存和网络资源的异构数据中心正变得越来越流行,以满足对时间敏感的应用程序的资源需求。TensorFlow平台就是这样一个应用框架,它已经成为运行机器学习工作负载的首选平台。最先进的TensorFlow平台不关心底层数据中心资源的可用性和性能配置文件,也不考虑分布式训练的给定工作负载的资源需求。这导致在繁忙和资源受限的工作节点上执行训练任务,从而导致总体训练时间的显着增加。在本文中,我们解决了这一挑战,并提出了架构改进和默认TensorFlow平台中的新软件模块,以使其了解底层数据中心资源的可用性和功能。所提出的基于基础设施感知的张量流有效地将训练任务安排在最佳资源上执行,并减少了总体训练时间。我们使用具有不同可用性和性能配置文件的工作节点进行的评估表明,与默认TensorFlow平台相比,拟议的增强可减少高达54%的训练时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信