BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.89

Reza Mokhtari, M. Stumm

{"title":"BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications","authors":"Reza Mokhtari, M. Stumm","doi":"10.1109/IPDPS.2014.89","DOIUrl":null,"url":null,"abstract":"GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways, e.g., for transformations, filtering, aggregation, partitioning or other \"Big Data\" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose Big Kernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. Big Kernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently, these kernels are transformed into Big Kernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that Big Kernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways, e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose Big Kernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. Big Kernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently, these kernels are transformed into Big Kernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that Big Kernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.

查看原文本刊更多论文

BigKernel——面向大数据风格应用的高性能CPU-GPU通信管道

gpu提供的计算能力和内存带宽比cpu高一个数量级。因此，gpu似乎非常适合加速以独立方式对大量数据集进行操作的计算，例如，用于转换、过滤、聚合、分区或其他“大数据”风格的处理。然而，经验表明，编写GPGPU程序来有效地处理不适合GPU内存的数据是很困难的，而且经常容易出错，部分原因是GPU硬件架构和编程模型的复杂性，部分原因是GPU和cpu之间可用的带宽有限。在本文中，我们提出了一个大内核方案，该方案为GPU应用程序提供伪虚拟内存，并使用带有自动预取的4阶段管道来实现(i)优化CPU-GPU通信和(ii)优化GPU内存访问。Big Kernel简化了编程模型，它允许程序员使用任意大的数据结构来编写内核，这些数据结构可以被分割成段，每个段都是独立操作的，这些内核使用直接的编译器转换转换成Big Kernel。我们对六个数据密集型基准测试的评估表明，与最先进的双缓冲技术相比，Big Kernel的平均加速提高了1.7，与相应的多线程CPU实现相比，平均加速提高了3.0。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量