Three-layer optimizations for fast GMM computations on GPU-like parallel processors

2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI:10.1109/ASRU.2009.5373410

Kshitij Gupta, John Douglas Owens

引用次数: 21

Abstract

In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.

查看原文本刊更多论文

在类似gpu的并行处理器上进行快速GMM计算的三层优化

在本文中，我们专注于为运行在类似gpu的并行处理器上的低端、小尺寸设备优化计算和内存带宽密集型GMM计算。由于缺乏类cpu的缓存，在类gpu的并行处理器上提供时间局域性，我们特别强调解决内存带宽问题，我们提出了对三种著名的GMM计算减少技术的修改。我们在GMM计算的帧、CI-GMM和混合层上发现了相当大的局部性，并展示了如何通过基于块的技术为每次GMM负载处理多个帧来提取它。与现有的基于gpu的快速GMM计算技术相比，在1000字、命令和控制、连续语音任务上，我们能够分别节省超过60%和90%的计算和内存带宽，但准确性有所下降。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE Workshop on Automatic Speech Recognition & Understanding

自引率

0.00%

发文量