Optimality of Fundamental Parallel Algorithms on the Hierarchical Memory Machine, with GPU Implementation

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2015-03-04 DOI:10.1109/PDP.2015.46

K. Nakano, Yasuaki Ito

{"title":"Optimality of Fundamental Parallel Algorithms on the Hierarchical Memory Machine, with GPU Implementation","authors":"K. Nakano, Yasuaki Ito","doi":"10.1109/PDP.2015.46","DOIUrl":null,"url":null,"abstract":"The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of CUDA-enabled GPU architecture. It has multiple streaming multiprocessors with a shared memory, and the global memory that can be accessed by all threads. The HMM has several parameters: the number d of streaming multiprocessors, the number p of threads per streaming multiprocessor, the number w of memory banks of each shared memory and the global memory, shared memory latency l, and global memory latency L. The main purpose of this paper is to discuss optimality of fundamental parallel algorithms running on the HMM. We first show that image convolution for an image with n × n pixels using a filter of size (2v+1) × (2v+1) can be done in O(n2/w+n2L/dp+n2v2/dw+n2v2l/dp) time units on the HMM. Further, we show that this parallel implementation is time optimal by proving the lower bound of the running time. We then go on to show that the product of two n × n matrices can be computed in O(n3/mw+n3L/mdp+n3/dw+n3l/dp) time units on the HMM if the capacity of the shared memory in each streaming multiprocessor is O(m2). This implementation is also proved to be time optimal. We further clarify the conditions for image convolution and matrix multiplication to hide the memory access latency overhead and to maximize the global memory throughput and the parallelism. Finally, we provide experimental results on GeForce GTX Titan to support our theoretical analysis.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2015.46","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of CUDA-enabled GPU architecture. It has multiple streaming multiprocessors with a shared memory, and the global memory that can be accessed by all threads. The HMM has several parameters: the number d of streaming multiprocessors, the number p of threads per streaming multiprocessor, the number w of memory banks of each shared memory and the global memory, shared memory latency l, and global memory latency L. The main purpose of this paper is to discuss optimality of fundamental parallel algorithms running on the HMM. We first show that image convolution for an image with n × n pixels using a filter of size (2v+1) × (2v+1) can be done in O(n2/w+n2L/dp+n2v2/dw+n2v2l/dp) time units on the HMM. Further, we show that this parallel implementation is time optimal by proving the lower bound of the running time. We then go on to show that the product of two n × n matrices can be computed in O(n3/mw+n3L/mdp+n3/dw+n3l/dp) time units on the HMM if the capacity of the shared memory in each streaming multiprocessor is O(m2). This implementation is also proved to be time optimal. We further clarify the conditions for image convolution and matrix multiplication to hide the memory access latency overhead and to maximize the global memory throughput and the parallelism. Finally, we provide experimental results on GeForce GTX Titan to support our theoretical analysis.

查看原文本刊更多论文

分层存储机上基本并行算法的最优性，用GPU实现

分层存储机(HMM)是一种理论上的并行计算模型，它捕捉了支持cuda的GPU架构的本质。它有多个具有共享内存的流多处理器，以及所有线程都可以访问的全局内存。HMM有几个参数:流多处理器的数量d，每个流多处理器的线程数p，每个共享内存和全局内存的内存库数量w，共享内存延迟l和全局内存延迟l。本文的主要目的是讨论运行在HMM上的基本并行算法的最优性。我们首先证明了使用大小为(2v+1) × (2v+1)的滤波器对n × n像素的图像进行卷积可以在HMM上的O(n2/w+n2L/dp+n2v2/dw+n2v2l/dp)时间单位内完成。进一步，我们通过证明运行时间的下界来证明这种并行实现是时间最优的。然后，我们继续证明，如果每个流多处理器的共享内存容量为O(m2)，则可以在HMM上以O(n3/mw+n3L/mdp+n3/dw+ n3L/ dp)时间单位计算两个n × n矩阵的乘积。这种实现也被证明是时间最优的。我们进一步阐明了图像卷积和矩阵乘法的条件，以隐藏内存访问延迟开销，并最大化全局内存吞吐量和并行性。最后，我们提供了GeForce GTX Titan上的实验结果来支持我们的理论分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量