原始语音识别的卷积神经网络

Vishal Passricha, R. Aggarwal
{"title":"原始语音识别的卷积神经网络","authors":"Vishal Passricha, R. Aggarwal","doi":"10.5772/INTECHOPEN.80026","DOIUrl":null,"url":null,"abstract":"State-of-the-art automatic speech recognition (ASR) systems map the speech signal into its corresponding text. Traditional ASR systems are based on Gaussian mixture model. The emergence of deep learning drastically improved the recognition rate of ASR systems. Such systems are replacing traditional ASR systems. These systems can also be trained in end-to-end manner. End-to-end ASR systems are gaining much popularity due to simpli- fied model-building process and abilities to directly map speech into the text without any predefined alignments. Three major types of end-to-end architectures for ASR are atten- tion-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. In this chapter, CNN-based acoustic model for raw speech signal is discussed. It establishes the relation between raw speech signal and phones in a data-driven manner. Relevant features and classifier both are jointly learned from the raw speech. Raw speech is processed by first convolutional layer to learn the feature representation. The output of first convolutional layer, that is, intermediate representation, is more discriminative and further processed by rest convolutional layers. This system uses only few parameters and performs better than traditional cepstral fea- ture-based systems. The performance of the system is evaluated for TIMIT and claimed similar performance as MFCC.","PeriodicalId":289041,"journal":{"name":"From Natural to Artificial Intelligence - Algorithms and Applications","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Convolutional Neural Networks for Raw Speech Recognition\",\"authors\":\"Vishal Passricha, R. Aggarwal\",\"doi\":\"10.5772/INTECHOPEN.80026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-of-the-art automatic speech recognition (ASR) systems map the speech signal into its corresponding text. Traditional ASR systems are based on Gaussian mixture model. The emergence of deep learning drastically improved the recognition rate of ASR systems. Such systems are replacing traditional ASR systems. These systems can also be trained in end-to-end manner. End-to-end ASR systems are gaining much popularity due to simpli- fied model-building process and abilities to directly map speech into the text without any predefined alignments. Three major types of end-to-end architectures for ASR are atten- tion-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. In this chapter, CNN-based acoustic model for raw speech signal is discussed. It establishes the relation between raw speech signal and phones in a data-driven manner. Relevant features and classifier both are jointly learned from the raw speech. Raw speech is processed by first convolutional layer to learn the feature representation. The output of first convolutional layer, that is, intermediate representation, is more discriminative and further processed by rest convolutional layers. This system uses only few parameters and performs better than traditional cepstral fea- ture-based systems. The performance of the system is evaluated for TIMIT and claimed similar performance as MFCC.\",\"PeriodicalId\":289041,\"journal\":{\"name\":\"From Natural to Artificial Intelligence - Algorithms and Applications\",\"volume\":\"166 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"From Natural to Artificial Intelligence - Algorithms and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5772/INTECHOPEN.80026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"From Natural to Artificial Intelligence - Algorithms and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5772/INTECHOPEN.80026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

摘要

最先进的自动语音识别(ASR)系统将语音信号映射到相应的文本中。传统的ASR系统是基于高斯混合模型的。深度学习的出现极大地提高了ASR系统的识别率。这种系统正在取代传统的ASR系统。这些系统也可以以端到端方式进行训练。端到端ASR系统由于简化的模型构建过程和直接将语音映射到文本而无需任何预定义对齐的能力而越来越受欢迎。ASR的三种主要端到端架构是基于注意力的方法、连接主义时间分类和基于卷积神经网络(CNN)的直接原始语音模型。本章讨论了基于cnn的原始语音信号声学模型。它以数据驱动的方式建立了原始语音信号和电话之间的关系。从原始语音中共同学习相关特征和分类器。原始语音经过第一层卷积处理,学习特征表示。第一层卷积层的输出,即中间表示,更具判别性,并由其余卷积层进一步处理。该系统使用的参数较少,性能优于传统的基于倒谱特征的系统。对该系统的性能进行了TIMIT评估,并声称其性能与MFCC相似。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Convolutional Neural Networks for Raw Speech Recognition
State-of-the-art automatic speech recognition (ASR) systems map the speech signal into its corresponding text. Traditional ASR systems are based on Gaussian mixture model. The emergence of deep learning drastically improved the recognition rate of ASR systems. Such systems are replacing traditional ASR systems. These systems can also be trained in end-to-end manner. End-to-end ASR systems are gaining much popularity due to simpli- fied model-building process and abilities to directly map speech into the text without any predefined alignments. Three major types of end-to-end architectures for ASR are atten- tion-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. In this chapter, CNN-based acoustic model for raw speech signal is discussed. It establishes the relation between raw speech signal and phones in a data-driven manner. Relevant features and classifier both are jointly learned from the raw speech. Raw speech is processed by first convolutional layer to learn the feature representation. The output of first convolutional layer, that is, intermediate representation, is more discriminative and further processed by rest convolutional layers. This system uses only few parameters and performs better than traditional cepstral fea- ture-based systems. The performance of the system is evaluated for TIMIT and claimed similar performance as MFCC.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信