{"title":"Design space exploration of the turbo decoding algorithm on GPUs","authors":"Dongwon Lee, M. Wolf, Hyesoon Kim","doi":"10.1145/1878921.1878953","DOIUrl":null,"url":null,"abstract":"In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors -- the occupancy and shared memory bank conflicts -- differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency.\n Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1878921.1878953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread block. In Turbo decoding, a degree of radix affects computational complexity and memory access patterns in both algorithmic and implementation viewpoints. Second, computations of branch metrics (BMs) and state metrics (SMs) have a different degree of parallelism, which affects the mapping method of computational tasks to GPU threads. Finally, we can easily adjust the number of sub-frames per thread block to balance the occupancy and memory access traffic. Experimental results show that the radix-4 algorithm with the SM-centric mapping method shows the best performance at four sub-frames per thread block. According to our analysis, two factors -- the occupancy and shared memory bank conflicts -- differentiate the performance of different cases in the design space. We show further performance improvements by optimizing a kernel operation (max*) and applying the MAX-Log-Maximum A Posteriori (MAP) algorithm. A performance bottleneck at the finally optimized case is global memory access latency.
Since the most optimized performance is comparable to that of the other programmable platforms, the GPU can be considered as another type of coprocessor for Turbo decoding implementations in mobile devices.
本文探讨了Turbo解码算法在gpu上的设计空间,并找到了性能瓶颈。我们考虑了设计空间探索的三个轴:基数度,并行化方法和每个线程块的子帧数。在Turbo解码中,从算法和实现的角度来看,基数的程度会影响计算复杂度和内存访问模式。其次,分支度量(BMs)和状态度量(SMs)的计算具有不同程度的并行性,这影响了计算任务到GPU线程的映射方法。最后,我们可以很容易地调整每个线程块的子帧数量,以平衡占用和内存访问流量。实验结果表明,基于sm中心映射方法的基数-4算法在每个线程块4个子帧时表现出最佳性能。根据我们的分析,两个因素——占用率和共享内存冲突——区分了设计空间中不同案例的性能。通过优化内核操作(max*)和应用max - log - maximum a Posteriori (MAP)算法,我们展示了进一步的性能改进。在最后优化的情况下,性能瓶颈是全局内存访问延迟。由于最优化的性能与其他可编程平台相当,GPU可以被视为移动设备中Turbo解码实现的另一种类型的协处理器。