Moiz Arif, M. M. Rafique, Seung-Hwan Lim, Zaki Malik
{"title":"Infrastructure-Aware TensorFlow for Heterogeneous Datacenters","authors":"Moiz Arif, M. M. Rafique, Seung-Hwan Lim, Zaki Malik","doi":"10.1109/MASCOTS50786.2020.9285969","DOIUrl":null,"url":null,"abstract":"Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.