SiP-ML: high-bandwidth optical network interconnects for machine learning training

Proceedings of the 2021 ACM SIGCOMM 2021 Conference Pub Date : 2021-08-09 DOI:10.1145/3452296.3472900

Mehrdad Khani Shirkoohi, M. Ghobadi, M. Alizadeh, Ziyi Zhu, M. Glick, K. Bergman, A. Vahdat, Benjamin Klenk, Eiman Ebrahimi

引用次数: 48

Abstract

This paper proposes optical network interconnects as a key enabler for building high-bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML, accelerates the training time of popular DNN models using silicon photonics links capable of providing multiple terabits-per-second of bandwidth per GPU. SiP-ML partitions the training job across GPUs with hybrid data and model parallelism while ensuring the communication pattern can be supported efficiently on the network interconnect. We develop task partitioning and device placement methods that take the degree and reconfiguration latency of optical interconnects into account. Simulations using real DNN models show that, compared to the state-of-the-art electrical networks, our approach improves training time by 1.3--9.1x.

查看原文本刊更多论文

SiP-ML:用于机器学习训练的高带宽光网络互连

本文提出光网络互连是构建具有强扩展特性的高带宽机器学习训练集群的关键促成因素。我们的设计，称为SiP-ML，使用能够为每个GPU提供每秒数太比特带宽的硅光子链路，加速流行DNN模型的训练时间。SiP-ML通过混合数据和模型并行性将训练任务跨gpu进行分区，同时保证在网络互连上有效地支持通信模式。我们开发了考虑光互连延迟程度和重构延迟的任务划分和器件放置方法。使用真实DNN模型的仿真表明，与最先进的电子网络相比，我们的方法将训练时间提高了1.3- 9.1倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2021 ACM SIGCOMM 2021 Conference

自引率

0.00%

发文量