Python programmers have GPUs too: automatic Python loop parallelization with staged dependence analysis

Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages Pub Date : 2019-10-20 DOI:10.1145/3359619.3359743

D. Jacob, P. Trinder, Jeremy Singer

{"title":"Python programmers have GPUs too: automatic Python loop parallelization with staged dependence analysis","authors":"D. Jacob, P. Trinder, Jeremy Singer","doi":"10.1145/3359619.3359743","DOIUrl":null,"url":null,"abstract":"Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks.","PeriodicalId":191261,"journal":{"name":"Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages","volume":"151 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3359619.3359743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks.

查看原文本刊更多论文

Python程序员也有gpu:带有阶段依赖分析的自动Python循环并行化

Python是许多应用程序领域中用于最终用户软件开发的流行语言。终端用户希望通过利用包括gpu在内的商用多核技术，有效地利用并行计算资源。然而，Python中现有的并行性方法是深奥的，对于典型的最终用户开发人员来说通常看起来过于复杂。我们认为，隐式或自动并行化是向最终用户提供多核心优势的最佳方式，因为它避免了特定于领域的语言、专业库、复杂的注释或限制性语言子集。自动并行化符合Python哲学，提供有效的性能，并且对于非专业开发人员来说很方便。尽管Python是一种动态语言，但我们证明了Python是自动并行化的合适目标。在对3000多个开源Python笔记本的实证研究中，我们证明了“在野外”的典型循环行为可以接受自动并行化。结果表明，分段相关性分析是实现性能最大化的有效方法。我们应用经典的依赖分析技术，然后利用Python运行时丰富的自省功能，以及时的方式解析额外的循环边界和变量类型。然后将并行循环嵌套代码转换为CUDA内核以供GPU执行。在12个循环密集的标准基准测试中，我们在基准解释执行上实现了数量级的加速，在CPU jit编译执行上实现了一些加速(高达50倍，尽管不一致)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages

自引率

0.00%

发文量