{"title":"The super warp architecture with random address shift","authors":"K. Nakano, Susumu Matsumae","doi":"10.1109/HiPC.2013.6799118","DOIUrl":null,"url":null,"abstract":"The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"283 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"20th Annual International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2013.6799118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.