用于TensorFlow深度学习模型分布式训练的自动代码转换

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Science of Computer Programming Pub Date : 2024-12-27 DOI:10.1016/j.scico.2024.103260

Yusung Sim , Wonho Shin , Sungho Lee

{"title":"用于TensorFlow深度学习模型分布式训练的自动代码转换","authors":"Yusung Sim , Wonho Shin , Sungho Lee","doi":"10.1016/j.scico.2024.103260","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed training of deep learning models reduces training time by parallelizing training workloads across multiple GPUs. Distributed training frameworks, such as Horovod and DeepSpeed, provide APIs, and model engineers rewrite deep learning models using the APIs to parallelize their training. However, the rewriting is time-consuming and labor-intensive because it requires engineers to read and understand documents and examples of the frameworks as well as manual efforts to rewrite code.</div><div>In this paper, we propose an automated code transformation approach that transforms TensorFlow deep learning models designed for non-distributed training to models training on multiple GPUs with the Horovod framework. We closely inspect the Horovod document and code examples and identify four common training patterns of TensorFlow deep learning models. Then, we formalize code transformation rules for each training pattern. Using the rules, we implement an automated code transformation tool that takes a TensorFlow deep learning model written in Python and rewrites it with the Horovod APIs for distributed training. Through source-code level transformation, our approach enables developers to efficiently scale existing DL models to multiple GPUs. Our evaluation shows that the tool correctly transforms 15 out of 16 open-source TensorFlow deep learning models. To the best of our knowledge, our work is the first automatic transformation technique for distributing existing TensorFlow deep learning models at the source code level. We believe that our approach significantly reduces manual efforts to parallelize training of existing TensorFlow deep learning models.</div></div>","PeriodicalId":49561,"journal":{"name":"Science of Computer Programming","volume":"242 ","pages":"Article 103260"},"PeriodicalIF":1.4000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated code transformation for distributed training of TensorFlow deep learning models\",\"authors\":\"Yusung Sim , Wonho Shin , Sungho Lee\",\"doi\":\"10.1016/j.scico.2024.103260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Distributed training of deep learning models reduces training time by parallelizing training workloads across multiple GPUs. Distributed training frameworks, such as Horovod and DeepSpeed, provide APIs, and model engineers rewrite deep learning models using the APIs to parallelize their training. However, the rewriting is time-consuming and labor-intensive because it requires engineers to read and understand documents and examples of the frameworks as well as manual efforts to rewrite code.</div><div>In this paper, we propose an automated code transformation approach that transforms TensorFlow deep learning models designed for non-distributed training to models training on multiple GPUs with the Horovod framework. We closely inspect the Horovod document and code examples and identify four common training patterns of TensorFlow deep learning models. Then, we formalize code transformation rules for each training pattern. Using the rules, we implement an automated code transformation tool that takes a TensorFlow deep learning model written in Python and rewrites it with the Horovod APIs for distributed training. Through source-code level transformation, our approach enables developers to efficiently scale existing DL models to multiple GPUs. Our evaluation shows that the tool correctly transforms 15 out of 16 open-source TensorFlow deep learning models. To the best of our knowledge, our work is the first automatic transformation technique for distributing existing TensorFlow deep learning models at the source code level. We believe that our approach significantly reduces manual efforts to parallelize training of existing TensorFlow deep learning models.</div></div>\",\"PeriodicalId\":49561,\"journal\":{\"name\":\"Science of Computer Programming\",\"volume\":\"242 \",\"pages\":\"Article 103260\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-12-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science of Computer Programming\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167642324001837\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of Computer Programming","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167642324001837","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

深度学习模型的分布式训练通过在多个gpu上并行化训练工作负载来减少训练时间。分布式训练框架，如Horovod和DeepSpeed，提供api，模型工程师使用api重写深度学习模型以并行化他们的训练。然而，重写是耗时和劳动密集型的，因为它要求工程师阅读和理解框架的文档和示例，以及手动重写代码。在本文中，我们提出了一种自动代码转换方法，该方法将为非分布式训练设计的TensorFlow深度学习模型转换为使用Horovod框架在多个gpu上训练的模型。我们仔细检查了Horovod文档和代码示例，并确定了TensorFlow深度学习模型的四种常见训练模式。然后，我们形式化了每个训练模式的代码转换规则。使用这些规则，我们实现了一个自动代码转换工具，该工具采用用Python编写的TensorFlow深度学习模型，并使用Horovod api重写它以进行分布式训练。通过源代码级转换，我们的方法使开发人员能够有效地将现有DL模型扩展到多个gpu。我们的评估表明，该工具正确地转换了16个开源TensorFlow深度学习模型中的15个。据我们所知，我们的工作是第一个在源代码级别分发现有TensorFlow深度学习模型的自动转换技术。我们认为，我们的方法显著减少了对现有TensorFlow深度学习模型进行并行训练的人工工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated code transformation for distributed training of TensorFlow deep learning models

Distributed training of deep learning models reduces training time by parallelizing training workloads across multiple GPUs. Distributed training frameworks, such as Horovod and DeepSpeed, provide APIs, and model engineers rewrite deep learning models using the APIs to parallelize their training. However, the rewriting is time-consuming and labor-intensive because it requires engineers to read and understand documents and examples of the frameworks as well as manual efforts to rewrite code.

In this paper, we propose an automated code transformation approach that transforms TensorFlow deep learning models designed for non-distributed training to models training on multiple GPUs with the Horovod framework. We closely inspect the Horovod document and code examples and identify four common training patterns of TensorFlow deep learning models. Then, we formalize code transformation rules for each training pattern. Using the rules, we implement an automated code transformation tool that takes a TensorFlow deep learning model written in Python and rewrites it with the Horovod APIs for distributed training. Through source-code level transformation, our approach enables developers to efficiently scale existing DL models to multiple GPUs. Our evaluation shows that the tool correctly transforms 15 out of 16 open-source TensorFlow deep learning models. To the best of our knowledge, our work is the first automatic transformation technique for distributing existing TensorFlow deep learning models at the source code level. We believe that our approach significantly reduces manual efforts to parallelize training of existing TensorFlow deep learning models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science of Computer Programming 工程技术-计算机：软件工程

CiteScore

3.80

自引率

0.00%

发文量

审稿时长

67 days

期刊介绍： Science of Computer Programming is dedicated to the distribution of research results in the areas of software systems development, use and maintenance, including the software aspects of hardware design. The journal has a wide scope ranging from the many facets of methodological foundations to the details of technical issues andthe aspects of industrial practice. The subjects of interest to SCP cover the entire spectrum of methods for the entire life cycle of software systems, including • Requirements, specification, design, validation, verification, coding, testing, maintenance, metrics and renovation of software; • Design, implementation and evaluation of programming languages; • Programming environments, development tools, visualisation and animation; • Management of the development process; • Human factors in software, software for social interaction, software for social computing; • Cyber physical systems, and software for the interaction between the physical and the machine; • Software aspects of infrastructure services, system administration, and network management.