Parallelized feature extraction and acoustic model training

2014 19th International Conference on Digital Signal Processing Pub Date : 2014-09-18 DOI:10.1109/ICDSP.2014.6900717

Haofeng Kou, Weijia Shang

{"title":"Parallelized feature extraction and acoustic model training","authors":"Haofeng Kou, Weijia Shang","doi":"10.1109/ICDSP.2014.6900717","DOIUrl":null,"url":null,"abstract":"In this paper, we present our research on the parallelized speech recognition including both Mel-Frequency Cepstral Coefficient (MFCC) feature extraction [1] and Viterbi training for Hidden Markov Model (HMM) based acoustic model [2] on the Graphics Processing Units (GPU). Robust and accurate speech recognition systems can only be realized with adequately trained acoustic models derived from the effectively parsed features. For common languages, state-of-the-art systems are extracted and trained on many thousands of hours of speech data and even with large clusters of machines the entire extracting and training process can take weeks. To overcome this development bottleneck, we not only demonstrate that feature extraction and acoustic model training are suitable for GPUs, but also propose the optimized parallel implementation using highly parallel GPUs by combining the MFCC feature extraction along with Viterbi training for HMM acoustic model, illustrate its application concurrency characteristics, data working set sizes, and describe the optimizations required for effective throughput on GPU processors. We demonstrate that feature extraction and acoustic model training are well suited for GPUs. Using one GTX580 our approach is shown to be overall approximately 95x faster than a sequential CPU implementation at the same accuracy level, enabling feature extraction and acoustic model training to be performed at realtime.","PeriodicalId":301856,"journal":{"name":"2014 19th International Conference on Digital Signal Processing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 19th International Conference on Digital Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDSP.2014.6900717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In this paper, we present our research on the parallelized speech recognition including both Mel-Frequency Cepstral Coefficient (MFCC) feature extraction [1] and Viterbi training for Hidden Markov Model (HMM) based acoustic model [2] on the Graphics Processing Units (GPU). Robust and accurate speech recognition systems can only be realized with adequately trained acoustic models derived from the effectively parsed features. For common languages, state-of-the-art systems are extracted and trained on many thousands of hours of speech data and even with large clusters of machines the entire extracting and training process can take weeks. To overcome this development bottleneck, we not only demonstrate that feature extraction and acoustic model training are suitable for GPUs, but also propose the optimized parallel implementation using highly parallel GPUs by combining the MFCC feature extraction along with Viterbi training for HMM acoustic model, illustrate its application concurrency characteristics, data working set sizes, and describe the optimizations required for effective throughput on GPU processors. We demonstrate that feature extraction and acoustic model training are well suited for GPUs. Using one GTX580 our approach is shown to be overall approximately 95x faster than a sequential CPU implementation at the same accuracy level, enabling feature extraction and acoustic model training to be performed at realtime.

查看原文本刊更多论文

并行特征提取和声学模型训练

在本文中，我们在图形处理单元(GPU)上对并行化语音识别进行了研究，包括Mel-Frequency倒谱系数(MFCC)特征提取[1]和基于隐马尔可夫模型(HMM)声学模型的Viterbi训练[2]。鲁棒和准确的语音识别系统只有通过有效解析的特征得到训练有素的声学模型才能实现。对于普通语言，最先进的系统是在数千小时的语音数据上提取和训练的，即使是大型机器集群，整个提取和训练过程也可能需要数周时间。为了克服这一发展瓶颈，我们不仅证明了特征提取和声学模型训练适用于GPU，而且提出了将MFCC特征提取与HMM声学模型的Viterbi训练相结合，利用高度并行的GPU优化并行实现，说明了其应用并发特性、数据工作集大小，并描述了GPU处理器有效吞吐量所需的优化。我们证明了特征提取和声学模型训练非常适合gpu。使用一台GTX580，我们的方法总体上比相同精度水平的顺序CPU实现快约95倍，使特征提取和声学模型训练能够实时执行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 19th International Conference on Digital Signal Processing

自引率

0.00%

发文量