在类似gpu的并行处理器上进行快速GMM计算的三层优化

2009 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2009-12-01 DOI:10.1109/ASRU.2009.5373410

Kshitij Gupta, John Douglas Owens

{"title":"在类似gpu的并行处理器上进行快速GMM计算的三层优化","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/ASRU.2009.5373410","DOIUrl":null,"url":null,"abstract":"In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Three-layer optimizations for fast GMM computations on GPU-like parallel processors\",\"authors\":\"Kshitij Gupta, John Douglas Owens\",\"doi\":\"10.1109/ASRU.2009.5373410\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.\",\"PeriodicalId\":292194,\"journal\":{\"name\":\"2009 IEEE Workshop on Automatic Speech Recognition & Understanding\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE Workshop on Automatic Speech Recognition & Understanding\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2009.5373410\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2009.5373410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

在本文中，我们专注于为运行在类似gpu的并行处理器上的低端、小尺寸设备优化计算和内存带宽密集型GMM计算。由于缺乏类cpu的缓存，在类gpu的并行处理器上提供时间局域性，我们特别强调解决内存带宽问题，我们提出了对三种著名的GMM计算减少技术的修改。我们在GMM计算的帧、CI-GMM和混合层上发现了相当大的局部性，并展示了如何通过基于块的技术为每次GMM负载处理多个帧来提取它。与现有的基于gpu的快速GMM计算技术相比，在1000字、命令和控制、连续语音任务上，我们能够分别节省超过60%和90%的计算和内存带宽，但准确性有所下降。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Three-layer optimizations for fast GMM computations on GPU-like parallel processors

In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 IEEE Workshop on Automatic Speech Recognition & Understanding

自引率

0.00%

发文量