高效分布式深度学习的资源使用基准测试

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-01-28 DOI:10.1109/HPEC55821.2022.9926375

Nathan C Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, V. Gadepally, S. Samsi

{"title":"高效分布式深度学习的资源使用基准测试","authors":"Nathan C Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, V. Gadepally, S. Samsi","doi":"10.1109/HPEC55821.2022.9926375","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy -saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Benchmarking Resource Usage for Efficient Distributed Deep Learning\",\"authors\":\"Nathan C Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, V. Gadepally, S. Samsi\",\"doi\":\"10.1109/HPEC55821.2022.9926375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy -saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.\",\"PeriodicalId\":200071,\"journal\":{\"name\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC55821.2022.9926375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

为了获得巨大的收益，深度学习(DL)工作流需要不断增加的计算和能量预算。因此，有必要了解不同的深度神经网络(dnn)和训练如何利用不断增加的计算和能源资源，特别是跨不同领域和应用的专门计算密集型模型。在本文中，我们在多达424个图形处理单元(gpu)上进行了3400多个实验，训练了一系列代表各种领域/任务的深度网络-自然语言处理，计算机视觉和化学。在训练过程中，我们的实验系统地改变了计算资源特征和节能机制，如功率利用率和GPU时钟速率限制，以捕获和说明每个代表性模型在各种资源和能源约束下表现出的不同权衡和缩放行为。我们拟合幂律模型，该模型描述了训练时间如何随可用的计算资源和能量限制而变化。我们预计这些发现将有助于指导高性能计算提供商优化资源利用，通过选择性地减少不同深度学习任务/工作流程的能耗，同时对训练的影响最小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy -saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量