面向研究的深度神经网络分布式训练灵活框架

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2021-06-01 DOI:10.1109/IPDPSW52791.2021.00110

S. Barrachina, Adrián Castelló, M. Catalán, M. F. Dolz, José I. Mestre

{"title":"面向研究的深度神经网络分布式训练灵活框架","authors":"S. Barrachina, Adrián Castelló, M. Catalán, M. F. Dolz, José I. Mestre","doi":"10.1109/IPDPSW52791.2021.00110","DOIUrl":null,"url":null,"abstract":"We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation.This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google’s TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks\",\"authors\":\"S. Barrachina, Adrián Castelló, M. Catalán, M. F. Dolz, José I. Mestre\",\"doi\":\"10.1109/IPDPSW52791.2021.00110\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation.This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google’s TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.\",\"PeriodicalId\":170832,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW52791.2021.00110\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

我们提出PyDTNN，一个用于在计算机集群上训练深度神经网络(dnn)的框架，它被设计为具有低学习曲线的研究型工具。我们的并行训练框架提供了一组功能，涵盖了高级深度学习(DL)软件的几个必备功能:1)它是用Python开发的，以便为新手提供一个可访问的入口点;2)它是可扩展的，允许用户在不需要处理复杂的软件堆栈的情况下创建新的研究想法的原型;3)提供高并行性能，通过mpi4py/NCCL利用MPI进行通信;以及NumPy、cuDNN和cuBLAS进行计算。本文提供了实际证据，证明PyDTNN获得了与Google的TensorFlow (TF)相似的精度和并行性能，尽管我们认识到PyDTNN在成熟度和功能方面无法与TF或PyTorch等生产级框架竞争。相反，PyDTNN被设计成一个可访问和可定制的工具，用于在集群上对DNN模型进行分布式训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation.This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google’s TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量