具有屏障同步的异步内存机模型

2012 Third International Conference on Networking and Computing Pub Date : 2012-12-05 DOI:10.1109/ICNC.2012.18

K. Nakano

{"title":"具有屏障同步的异步内存机模型","authors":"K. Nakano","doi":"10.1109/ICNC.2012.18","DOIUrl":null,"url":null,"abstract":"The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It was assumed that warps (i.e. groups of threads) on the DMM and the UMM work synchronously in the round-robin manner. However, warps work asynchronously in the actual GPUs, in the sense that warps may be randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce an asynchronous version of the DMM and the UMM, in which warps are arbitrarily dispatched. Instead, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation is costly, we should evaluate and minimize the number of barrier synchronization operations performed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to compute the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+ l log n) time units and O(log l/w + log log w) barrier synchronization steps using wl threads both on the asynchronous DMM and on the asynchronous UMM with width w and latency l. We also prove that the computing time is optimal because it matches the theoretical lower bound. Quite surprisingly, the number of barrier synchronization steps and the number of threads are independent of n. Even if the input size n is quite large, our parallel algorithm computes the sum in optimal time units and a fixed number of syncthreads using a fixed number of threads.","PeriodicalId":442973,"journal":{"name":"2012 Third International Conference on Networking and Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Asynchronous Memory Machine Models with Barrier Synchronization\",\"authors\":\"K. Nakano\",\"doi\":\"10.1109/ICNC.2012.18\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It was assumed that warps (i.e. groups of threads) on the DMM and the UMM work synchronously in the round-robin manner. However, warps work asynchronously in the actual GPUs, in the sense that warps may be randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce an asynchronous version of the DMM and the UMM, in which warps are arbitrarily dispatched. Instead, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation is costly, we should evaluate and minimize the number of barrier synchronization operations performed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to compute the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+ l log n) time units and O(log l/w + log log w) barrier synchronization steps using wl threads both on the asynchronous DMM and on the asynchronous UMM with width w and latency l. We also prove that the computing time is optimal because it matches the theoretical lower bound. Quite surprisingly, the number of barrier synchronization steps and the number of threads are independent of n. Even if the input size n is quite large, our parallel algorithm computes the sum in optimal time units and a fixed number of syncthreads using a fixed number of threads.\",\"PeriodicalId\":442973,\"journal\":{\"name\":\"2012 Third International Conference on Networking and Computing\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 Third International Conference on Networking and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNC.2012.18\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Third International Conference on Networking and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNC.2012.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

离散内存机(DMM)和统一内存机(UMM)是理论上的并行计算模型，它们捕捉了gpu的共享内存和全局内存的本质。假设DMM和UMM上的warp(即线程组)以轮询方式同步工作。然而，在实际的gpu中，warp是异步工作的，因为warp可能是随机(或任意)分配执行的。本文的第一个贡献是介绍了DMM和UMM的异步版本，其中任意分配翘曲。相反，我们假设线程可以执行“syncthreads”指令进行屏障同步。由于屏障同步操作的成本很高，我们应该评估并最小化并行算法执行的屏障同步操作的数量。本文的第二个贡献是给出了一种并行算法，在最优的计算时间和很少的障碍同步步骤内计算n个数的和。我们的并行算法在O(n/w+ l log n)个时间单位和O(log l/w + log log w)个屏障同步步骤中计算n个数字的和，在异步DMM和异步UMM上使用wl线程，宽度为w，延迟为l。我们还证明了计算时间是最优的，因为它符合理论下界。令人惊讶的是，屏障同步步骤的数量和线程的数量与n无关。即使输入大小n非常大，我们的并行算法也会使用固定数量的线程以最优时间单位和固定数量的同步线程计算总和。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Asynchronous Memory Machine Models with Barrier Synchronization

The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It was assumed that warps (i.e. groups of threads) on the DMM and the UMM work synchronously in the round-robin manner. However, warps work asynchronously in the actual GPUs, in the sense that warps may be randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce an asynchronous version of the DMM and the UMM, in which warps are arbitrarily dispatched. Instead, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation is costly, we should evaluate and minimize the number of barrier synchronization operations performed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to compute the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+ l log n) time units and O(log l/w + log log w) barrier synchronization steps using wl threads both on the asynchronous DMM and on the asynchronous UMM with width w and latency l. We also prove that the computing time is optimal because it matches the theoretical lower bound. Quite surprisingly, the number of barrier synchronization steps and the number of threads are independent of n. Even if the input size n is quite large, our parallel algorithm computes the sum in optimal time units and a fixed number of syncthreads using a fixed number of threads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 Third International Conference on Networking and Computing

自引率

0.00%

发文量