Reasoning About Foreign Function Interfaces Without Modelling the Foreign Language

European Conference on Object-Oriented Programming Pub Date : 2018-10-28 DOI:10.4230/LIPIcs.ECOOP.2019.16

Alexi Turcotte, Ellen Arteca, G. Richards

{"title":"Reasoning About Foreign Function Interfaces Without Modelling the Foreign Language","authors":"Alexi Turcotte, Ellen Arteca, G. Richards","doi":"10.4230/LIPIcs.ECOOP.2019.16","DOIUrl":null,"url":null,"abstract":"Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. \nWe discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. \nTo make fast GPU programming available to average programmers, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. \nDynaSOAr improves the usage of allocated memory with a Structure of Arrays data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.","PeriodicalId":172012,"journal":{"name":"European Conference on Object-Oriented Programming","volume":"547 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Conference on Object-Oriented Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ECOOP.2019.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. To make fast GPU programming available to average programmers, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. DynaSOAr improves the usage of allocated memory with a Structure of Arrays data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

查看原文本刊更多论文

在没有对外语建模的情况下对外来函数接口进行推理

尽管许多重要的HPC应用程序具有固有的对象结构，但长期以来，面向对象编程一直被认为对SIMD高性能计算效率太低。在SIMD加速器(包括gpu)上，这主要是由于内存分配和内存访问方面的性能问题:有几个库直接在加速器设备上支持并行内存分配，但它们都受到未合并内存访问的影响。我们发现了一类广泛的面向对象程序，其中包含许多重要的实际应用程序，可以在大规模并行SIMD加速器上有效地实现。我们称这个类为单方法多对象(Single-Method Multiple-Objects, SMMO)，因为并行性是通过对一个类型的所有对象运行一个方法来表示的。为了让普通程序员也能使用快速的GPU编程，我们开发了DynaSOAr，这是一个用于SMMO应用程序的CUDA框架。DynaSOAr包括(1)一个完全并行的、无锁的动态内存分配器，(2)一个数据布局DSL和(3)一个高效的、并行的全操作。DynaSOAr通过控制内存分配和内存访问实现优于最先进的GPU内存分配器的性能。DynaSOAr通过数组结构数据布局提高了已分配内存的利用率，并通过无锁的分层位图有效管理空闲和已分配内存块，实现了低内存碎片。与其他分配器不同，我们的设计在很大程度上基于原子操作，以原始(非)分配性能为代价换取更好的整体应用程序性能。在我们的基准测试中，DynaSOAr实现了应用程序代码的加速，比最先进的分配器快3倍。此外，DynaSOAr比其他分配器更有效地管理堆内存，允许程序员使用相同数量的内存运行两倍大的问题大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Conference on Object-Oriented Programming

自引率

0.00%

发文量