{"title":"BlueConnect:异构网络层次上深度学习的分解全约法","authors":"M. Cho;U. Finkler;M. Serrano;D. Kung;H. Hunter","doi":"10.1147/JRD.2019.2947013","DOIUrl":null,"url":null,"abstract":"As deep neural networks get more complex and input datasets get larger, it can take days or even weeks to train a deep neural network to the desired accuracy. Therefore, enabling distributed deep learning at a massive scale is critical since it offers the potential to reduce the training time from weeks to hours. In this article, we present BlueConnect, an efficient communication library for distributed deep learning that is highly optimized for popular GPU-based platforms. BlueConnect decomposes a single all-reduce operation into a large number of parallelizable reduce–scatter and all-gather operations to exploit the tradeoff between latency and bandwidth and adapt to a variety of network configurations. Therefore, each individual operation can be mapped to a different network fabric and take advantage of the best performing implementation for the corresponding fabric. According to our experimental results on two system configurations, BlueConnect can outperform the leading industrial communication library by a wide margin, and the BlueConnect-integrated Caffe2 can significantly reduce synchronization overhead by 87% on 192 GPUs for Resnet-50 training over prior schemes.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2019-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2947013","citationCount":"71","resultStr":"{\"title\":\"BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy\",\"authors\":\"M. Cho;U. Finkler;M. Serrano;D. Kung;H. Hunter\",\"doi\":\"10.1147/JRD.2019.2947013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As deep neural networks get more complex and input datasets get larger, it can take days or even weeks to train a deep neural network to the desired accuracy. Therefore, enabling distributed deep learning at a massive scale is critical since it offers the potential to reduce the training time from weeks to hours. In this article, we present BlueConnect, an efficient communication library for distributed deep learning that is highly optimized for popular GPU-based platforms. BlueConnect decomposes a single all-reduce operation into a large number of parallelizable reduce–scatter and all-gather operations to exploit the tradeoff between latency and bandwidth and adapt to a variety of network configurations. Therefore, each individual operation can be mapped to a different network fabric and take advantage of the best performing implementation for the corresponding fabric. According to our experimental results on two system configurations, BlueConnect can outperform the leading industrial communication library by a wide margin, and the BlueConnect-integrated Caffe2 can significantly reduce synchronization overhead by 87% on 192 GPUs for Resnet-50 training over prior schemes.\",\"PeriodicalId\":55034,\"journal\":{\"name\":\"IBM Journal of Research and Development\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2019-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1147/JRD.2019.2947013\",\"citationCount\":\"71\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IBM Journal of Research and Development\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/8868152/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8868152/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy
As deep neural networks get more complex and input datasets get larger, it can take days or even weeks to train a deep neural network to the desired accuracy. Therefore, enabling distributed deep learning at a massive scale is critical since it offers the potential to reduce the training time from weeks to hours. In this article, we present BlueConnect, an efficient communication library for distributed deep learning that is highly optimized for popular GPU-based platforms. BlueConnect decomposes a single all-reduce operation into a large number of parallelizable reduce–scatter and all-gather operations to exploit the tradeoff between latency and bandwidth and adapt to a variety of network configurations. Therefore, each individual operation can be mapped to a different network fabric and take advantage of the best performing implementation for the corresponding fabric. According to our experimental results on two system configurations, BlueConnect can outperform the leading industrial communication library by a wide margin, and the BlueConnect-integrated Caffe2 can significantly reduce synchronization overhead by 87% on 192 GPUs for Resnet-50 training over prior schemes.
期刊介绍:
The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals.
Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.