在低端GPU处理器上进行高质量语音识别的计算和内存优化

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI:10.1109/HiPC.2011.6152741

Kshitij Gupta, John Douglas Owens

{"title":"在低端GPU处理器上进行高质量语音识别的计算和内存优化","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/HiPC.2011.6152741","DOIUrl":null,"url":null,"abstract":"Gaussian Mixture Model (GMM) computations in modern Automatic Speech Recognition systems are known to dominate the total processing time, and are both memory bandwidth and compute intensive. Graphics processors (GPU), are well suited for applications exhibiting data- and thread-level parallelism, as that exhibited by GMM score computations. By exploiting temporal locality over successive frames of speech, we have previously presented a theoretical framework for modifying the traditional speech processing pipeline and obtaining significant savings in compute and memory bandwidth requirements, especially on resource-constrained devices like those found in mobile devices. In this paper we discuss in detail our implementation for two of the three techniques we previously proposed, and suggest a set of guidelines of which technique is suitable for a given condition. For a medium-vocabulary, dictation task consisting of 5k words, we are able to reduce memory bandwidth by 80% for a 20% overhead in compute without loss in accuracy by applying the first technique, and memory and compute savings of 90% and 35% respectively for a 15% degradation in accuracy by using the second technique. We are able to achieve a 4× speed-up (to 6 tim es real-time performance), over the baseline on a low-end 9400M Nvidia GPU.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Compute & memory optimizations for high-quality speech recognition on low-end GPU processors\",\"authors\":\"Kshitij Gupta, John Douglas Owens\",\"doi\":\"10.1109/HiPC.2011.6152741\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gaussian Mixture Model (GMM) computations in modern Automatic Speech Recognition systems are known to dominate the total processing time, and are both memory bandwidth and compute intensive. Graphics processors (GPU), are well suited for applications exhibiting data- and thread-level parallelism, as that exhibited by GMM score computations. By exploiting temporal locality over successive frames of speech, we have previously presented a theoretical framework for modifying the traditional speech processing pipeline and obtaining significant savings in compute and memory bandwidth requirements, especially on resource-constrained devices like those found in mobile devices. In this paper we discuss in detail our implementation for two of the three techniques we previously proposed, and suggest a set of guidelines of which technique is suitable for a given condition. For a medium-vocabulary, dictation task consisting of 5k words, we are able to reduce memory bandwidth by 80% for a 20% overhead in compute without loss in accuracy by applying the first technique, and memory and compute savings of 90% and 35% respectively for a 15% degradation in accuracy by using the second technique. We are able to achieve a 4× speed-up (to 6 tim es real-time performance), over the baseline on a low-end 9400M Nvidia GPU.\",\"PeriodicalId\":122468,\"journal\":{\"name\":\"2011 18th International Conference on High Performance Computing\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 18th International Conference on High Performance Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2011.6152741\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 18th International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2011.6152741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

在现代自动语音识别系统中，高斯混合模型(Gaussian Mixture Model, GMM)的计算占据了整个处理时间，并且占用了大量的内存带宽和计算量。图形处理器(GPU)非常适合显示数据级和线程级并行性的应用程序，正如GMM分数计算所显示的那样。通过利用连续语音帧的时间局部性，我们之前提出了一个理论框架，用于修改传统的语音处理管道，并在计算和内存带宽要求方面获得显著节省，特别是在资源受限的设备上，如移动设备中发现的那些。在本文中，我们详细讨论了我们之前提出的三种技术中的两种技术的实现，并提出了一套适用于给定条件的技术指南。对于由5k个单词组成的中等词汇量的听写任务，通过应用第一种技术，我们能够在不损失准确性的情况下减少80%的内存带宽和20%的计算开销，并且通过使用第二种技术，我们能够分别节省90%和35%的内存和计算开销，而准确性降低15%。我们能够在低端的9400M Nvidia GPU的基础上实现4倍的加速(6倍的实时性能)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compute & memory optimizations for high-quality speech recognition on low-end GPU processors

Gaussian Mixture Model (GMM) computations in modern Automatic Speech Recognition systems are known to dominate the total processing time, and are both memory bandwidth and compute intensive. Graphics processors (GPU), are well suited for applications exhibiting data- and thread-level parallelism, as that exhibited by GMM score computations. By exploiting temporal locality over successive frames of speech, we have previously presented a theoretical framework for modifying the traditional speech processing pipeline and obtaining significant savings in compute and memory bandwidth requirements, especially on resource-constrained devices like those found in mobile devices. In this paper we discuss in detail our implementation for two of the three techniques we previously proposed, and suggest a set of guidelines of which technique is suitable for a given condition. For a medium-vocabulary, dictation task consisting of 5k words, we are able to reduce memory bandwidth by 80% for a 20% overhead in compute without loss in accuracy by applying the first technique, and memory and compute savings of 90% and 35% respectively for a 15% degradation in accuracy by using the second technique. We are able to achieve a 4× speed-up (to 6 tim es real-time performance), over the baseline on a low-end 9400M Nvidia GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 18th International Conference on High Performance Computing

自引率

0.00%

发文量