Ouroboros: virtualized queues for dynamic memory management on GPUs

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI:10.1145/3392717.3392742

Martin Winter, Daniel Mlakar, Mathias Parger, M. Steinberger

{"title":"Ouroboros: virtualized queues for dynamic memory management on GPUs","authors":"Martin Winter, Daniel Mlakar, Mathias Parger, M. Steinberger","doi":"10.1145/3392717.3392742","DOIUrl":null,"url":null,"abstract":"Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue. In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting. Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue. In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting. Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.

查看原文本刊更多论文

Ouroboros:用于gpu上动态内存管理的虚拟化队列

在单个指令、多线程架构(如图形处理单元(GPU))上进行动态内存分配是具有挑战性的，实现指南对此提出了警告。数据结构必须能够应对数千个并发活动线程试图分配内存的挑战。在过去，高效的排队结构已经被用于允许在GPU上直接分配和重用内存，但不能很好地扩展到不同的分配大小，因为每个分配都需要自己的队列。在这项工作中，我们提出了Ouroboros，一个虚拟队列结构，动态管理可分配的数据块，同时构建在这些相同的块之上。数据块被动态地解释为虚拟队列的构建块或分页的用户数据。可重用的用户内存有两种管理方式，一种是单独的页面，另一种是包含页面的块。队列结构动态增长和收缩，只有当前需要的队列块保存在内存中，释放的队列块可以在系统中重用。因此，我们保留了高效静态队列设计的性能优势，同时保持了较低的内存需求。在CUDA 10.1中对带有本机设备内存分配器的NVIDIA TITAN V进行性能评估显示，速度在11X到412X之间，平均为118X。对于实际测试，我们将分配器集成到faimGraph中，这是一个具有专有内存管理的动态图形框架。在所有内存密集型操作(如图形初始化和边缘更新)中，我们的分配器表现出类似的性能改进。此外，我们还展示了在PageRank和静态三角形计数上改进的算法性能。总的来说，我们的内存分配器可以被有效地初始化，允许高吞吐量的分配，并且通过它的每线程分配模型，可以替代类似的动态内存分配器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量