Dynamic control flow in large-scale machine learning

Proceedings of the Thirteenth EuroSys Conference Pub Date : 2018-04-23 DOI:10.1145/3190508.3190551

Yuan Yu, Martín Abadi, P. Barham, E. Brevdo, M. Burrows, Andy Davis, J. Dean, S. Ghemawat, Tim Harley, Peter Hawkins, M. Isard, M. Kudlur, R. Monga, D. Murray, Xiaoqiang Zheng

{"title":"Dynamic control flow in large-scale machine learning","authors":"Yuan Yu, Martín Abadi, P. Barham, E. Brevdo, M. Burrows, Andy Davis, J. Dean, S. Ghemawat, Tim Harley, Peter Hawkins, M. Isard, M. Kudlur, R. Monga, D. Murray, Xiaoqiang Zheng","doi":"10.1145/3190508.3190551","DOIUrl":null,"url":null,"abstract":"Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"1949 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Thirteenth EuroSys Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190508.3190551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

Abstract

Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from the ability to make rapid control-flow decisions across a set of computing devices in a distributed system. For performance, scalability, and expressiveness, a machine learning system must support dynamic control flow in distributed and heterogeneous environments. This paper presents a programming model for distributed machine learning that supports dynamic control flow. We describe the design of the programming model, and its implementation in TensorFlow, a distributed machine learning system. Our approach extends the use of dataflow graphs to represent machine learning models, offering several distinctive features. First, the branches of conditionals and bodies of loops can be partitioned across many machines to run on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs. Second, programs written in our model support automatic differentiation and distributed gradient computations, which are necessary for training machine learning models that use control flow. Third, our choice of non-strict semantics enables multiple loop iterations to execute in parallel across machines, and to overlap compute and I/O operations. We have done our work in the context of TensorFlow, and it has been used extensively in research and production. We evaluate it using several real-world applications, and demonstrate its performance and scalability.

查看原文本刊更多论文

大规模机器学习中的动态控制流

许多最近的机器学习模型依赖于细粒度的动态控制流来进行训练和推理。特别是，基于递归神经网络和强化学习的模型依赖于递归关系、依赖数据的条件执行和其他需要动态控制流的特征。这些应用程序得益于在分布式系统中跨一组计算设备快速做出控制流决策的能力。为了提高性能、可扩展性和表现力，机器学习系统必须支持分布式和异构环境中的动态控制流。提出了一种支持动态控制流的分布式机器学习编程模型。我们描述了编程模型的设计，以及它在TensorFlow(一个分布式机器学习系统)中的实现。我们的方法扩展了数据流图的使用，以表示机器学习模型，提供了几个独特的功能。首先，条件分支和循环主体可以在许多机器上进行分区，以便在一组异构设备上运行，包括cpu、gpu和定制的asic。其次，在我们的模型中编写的程序支持自动微分和分布式梯度计算，这对于训练使用控制流的机器学习模型是必要的。第三，我们选择的非严格语义使得多个循环迭代可以跨机器并行执行，并且可以重叠计算和I/O操作。我们已经在TensorFlow的背景下完成了我们的工作，并且它已广泛用于研究和生产。我们使用几个实际应用程序对其进行了评估，并演示了其性能和可伸缩性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Thirteenth EuroSys Conference

自引率

0.00%

发文量