Xin Jin, Zhen Zhang, Yunshan Jia, Yun Ma, Xuanzhe Liu
{"title":"SDCC:用于分布式培训的软件定义集体通信","authors":"Xin Jin, Zhen Zhang, Yunshan Jia, Yun Ma, Xuanzhe Liu","doi":"10.1007/s11432-023-3894-4","DOIUrl":null,"url":null,"abstract":"<p>Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.</p>","PeriodicalId":21618,"journal":{"name":"Science China Information Sciences","volume":"25 1","pages":""},"PeriodicalIF":7.3000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SDCC: software-defined collective communication for distributed training\",\"authors\":\"Xin Jin, Zhen Zhang, Yunshan Jia, Yun Ma, Xuanzhe Liu\",\"doi\":\"10.1007/s11432-023-3894-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.</p>\",\"PeriodicalId\":21618,\"journal\":{\"name\":\"Science China Information Sciences\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science China Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11432-023-3894-4\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science China Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11432-023-3894-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
SDCC: software-defined collective communication for distributed training
Communication is crucial to the performance of distributed training. Today’s solutions tightly couple the control and data planes and lack flexibility, generality, and performance. In this study, we present SDCC, a software-defined collective communication framework for distributed training. SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane. SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph. The abstraction, together with the unification, is powerful: it enables users to easily express new and existing collective communication algorithms and optimizations, simplifies the integration with different computing engines (e.g., PyTorch and TensorFlow) and network transports (e.g., Linux TCP and kernel bypass), and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph. We further demonstrate the benefits of SDCC in four use cases.
期刊介绍:
Science China Information Sciences is a dedicated journal that showcases high-quality, original research across various domains of information sciences. It encompasses Computer Science & Technologies, Control Science & Engineering, Information & Communication Engineering, Microelectronics & Solid-State Electronics, and Quantum Information, providing a platform for the dissemination of significant contributions in these fields.