语音自动识别的性能分析与优化

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-08-14 DOI:10.1109/TMSCS.2017.2739158

Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González

{"title":"语音自动识别的性能分析与优化","authors":"Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González","doi":"10.1109/TMSCS.2017.2739158","DOIUrl":null,"url":null,"abstract":"Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"847-860"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2739158","citationCount":"11","resultStr":"{\"title\":\"Performance Analysis and Optimization of Automatic Speech Recognition\",\"authors\":\"Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González\",\"doi\":\"10.1109/TMSCS.2017.2739158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.\",\"PeriodicalId\":100643,\"journal\":{\"name\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"volume\":\"4 4\",\"pages\":\"847-860\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2739158\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/8010340/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8010340/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

快速准确的自动语音识别（ASR）正在成为移动设备的关键应用。由于问题的计算密集性和嵌入式系统的功率限制，在此类设备上提供ASR具有挑战性。在本文中，我们提供了Pocketsphinx的性能和能量表征，这是一个针对移动设备的ASR的流行工具集。我们将高斯混合模型（GMM）的计算确定为主要瓶颈，消耗了80%以上的执行时间。CPI堆栈分析表明，分支和主存访问是限制GMM计算性能的主要因素。我们提出了一些由功率/性能分析驱动的软件级优化。与以前通过减少评估的高斯数来以准确性换取性能的建议不同，我们通过有效使用底层CPU微体系结构来保持准确性并提高性能。首先，我们使用GMM评估代码最内部循环的重构实现来改善分支的影响。其次，我们利用大多数现代CPU上可用的矢量单元来提高GMM计算，引入了一种新的存储器布局来存储高斯的均值和方差，以最大限度地提高矢量化的有效性。第三，我们并行计算多个帧的高斯，因此可以在片上缓存中提取一次均值和方差，并在多个帧之间重用，从而显著减少内存带宽使用。我们使用真实CPU上的硬件计数器和模拟来评估我们的优化。我们的实验结果表明，在高端Intel Skylake CPU上，所提出的优化比基线Pocketsphinx解码器提供了2.68倍的加速，同时实现了61%的节能。在现代ARM Cortex-A57移动处理器上，我们的技术将性能提高了1.85倍，同时在不损失ASR系统精度的情况下节省了59%的能源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Analysis and Optimization of Automatic Speech Recognition

Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量