SiP-ML:用于机器学习训练的高带宽光网络互连

Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi
{"title":"SiP-ML:用于机器学习训练的高带宽光网络互连","authors":"Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi","doi":"10.1145/3452296.3472900","DOIUrl":null,"url":null,"abstract":"This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.","PeriodicalId":20487,"journal":{"name":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"SiP-ML: high-bandwidth optical network interconnects for machine learning training\",\"authors\":\"Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi\",\"doi\":\"10.1145/3452296.3472900\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.\",\"PeriodicalId\":20487,\"journal\":{\"name\":\"Proceedings of the 2021 ACM SIGCOMM 2021 Conference\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 ACM SIGCOMM 2021 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452296.3472900\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM SIGCOMM 2021 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452296.3472900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 48

摘要

本文提出光网络互连是构建具有强扩展特性的高带宽机器学习训练集群的关键促成因素。我们的设计,称为SiP-ML,使用能够为每个GPU提供每秒数太比特带宽的硅光子链路,加速流行DNN模型的训练时间。SiP-ML通过混合数据和模型并行性将训练任务跨gpu进行分区,同时保证在网络互连上有效地支持通信模式。我们开发了考虑光互连延迟程度和重构延迟的任务划分和器件放置方法。使用真实DNN模型的仿真表明,与最先进的电子网络相比,我们的方法将训练时间提高了1.3- 9.1倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SiP-ML: high-bandwidth optical network interconnects for machine learning training
This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信