Ryan Price, Bhargav Srinivas Ch, Surbhi Singhal, S. Bangalore
{"title":"Investigating the Downstream Impact of Grapheme-Based Acoustic Modeling on Spoken Utterance Classification","authors":"Ryan Price, Bhargav Srinivas Ch, Surbhi Singhal, S. Bangalore","doi":"10.1109/SLT.2018.8639549","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition (ASR) and natural language understanding are critical components of spoken language understanding (SLU) systems. One obstacle to providing services with SLU systems in multiple languages is the cost associated with acquiring all of the language-specific resources required for ASR in each language. Modeling graphemes eliminates the need to obtain a pronunciation dictionary which maps from speech sounds to words and is one way to reduce ASR resource dependencies when rapidly developing ASR in new languages. However, little is known about the downstream impact on SLU task performance when selecting graphemes as the acoustic modeling unit. This work investigates acoustic modeling for the ASR component of an SLU system using grapheme-based approaches together with convolutional and recurrent neural network architectures. We evaluate both ASR word accuracy and spoken utterance classification (SUC) accuracy for English, Italian and Spanish language tasks and find that it is possible to achieve SUC accuracy that is comparable to conventional phoneme-based systems which leverage a pronunciation dictionary.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639549","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Automatic speech recognition (ASR) and natural language understanding are critical components of spoken language understanding (SLU) systems. One obstacle to providing services with SLU systems in multiple languages is the cost associated with acquiring all of the language-specific resources required for ASR in each language. Modeling graphemes eliminates the need to obtain a pronunciation dictionary which maps from speech sounds to words and is one way to reduce ASR resource dependencies when rapidly developing ASR in new languages. However, little is known about the downstream impact on SLU task performance when selecting graphemes as the acoustic modeling unit. This work investigates acoustic modeling for the ASR component of an SLU system using grapheme-based approaches together with convolutional and recurrent neural network architectures. We evaluate both ASR word accuracy and spoken utterance classification (SUC) accuracy for English, Italian and Spanish language tasks and find that it is possible to achieve SUC accuracy that is comparable to conventional phoneme-based systems which leverage a pronunciation dictionary.