Yu-Chih Deng , Yuan-Fu Liao , Yih-Ru Wang , Sin-Horng Chen
{"title":"普通话自发语的丰富解码","authors":"Yu-Chih Deng , Yuan-Fu Liao , Yih-Ru Wang , Sin-Horng Chen","doi":"10.1016/j.specom.2023.102983","DOIUrl":null,"url":null,"abstract":"<div><p>A deep neural network (DNN)-based automatic speech recognition (ASR) method for enriched decoding of Mandarin spontaneous speech is proposed. It adopts an enhanced approach over the baseline model built with factored time delay neural networks (TDNN-f) and rescored with RNNLM to first building a baseline system composed of a TDNN-f acoustic model (AM), a trigram language model (LM), and a recurrent neural network language model (RNNLM) to generate a word lattice. It then sequentially incorporates a multi-task Part-of-Speech-RNNLM (POS-RNNLM), a hierarchical prosodic model (HPM), and a reduplication-word LM (RLM) into the decoding process by expanding the word lattice and performing rescoring to improve recognition performance and enrich the decoding output with syntactic parameters of POS and punctuation (PM), prosodic tags of word-juncture break types and syllable prosodic states, and an edited recognition text with reduplication words being eliminated. Experimental results on the Mandarin conversational dialogue corpus (MCDC) showed that SER, CER, and WER of 13.2 %, 13.9 %, and 19.1 % were achieved when incorporating the POS-RNNLM and HPM into the baseline system. They represented relative SER, CER, and WER reductions of 7.7 %, 7.9 % and 5.0 % as comparing with those of the baseline system. Futhermore, the use of the RLM resulted in additional 3 %, 4.6 %, and 4.5 % relative SER, CER, and WER reductions through eliminating reduplication words.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"154 ","pages":"Article 102983"},"PeriodicalIF":2.4000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward enriched decoding of mandarin spontaneous speech\",\"authors\":\"Yu-Chih Deng , Yuan-Fu Liao , Yih-Ru Wang , Sin-Horng Chen\",\"doi\":\"10.1016/j.specom.2023.102983\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>A deep neural network (DNN)-based automatic speech recognition (ASR) method for enriched decoding of Mandarin spontaneous speech is proposed. It adopts an enhanced approach over the baseline model built with factored time delay neural networks (TDNN-f) and rescored with RNNLM to first building a baseline system composed of a TDNN-f acoustic model (AM), a trigram language model (LM), and a recurrent neural network language model (RNNLM) to generate a word lattice. It then sequentially incorporates a multi-task Part-of-Speech-RNNLM (POS-RNNLM), a hierarchical prosodic model (HPM), and a reduplication-word LM (RLM) into the decoding process by expanding the word lattice and performing rescoring to improve recognition performance and enrich the decoding output with syntactic parameters of POS and punctuation (PM), prosodic tags of word-juncture break types and syllable prosodic states, and an edited recognition text with reduplication words being eliminated. Experimental results on the Mandarin conversational dialogue corpus (MCDC) showed that SER, CER, and WER of 13.2 %, 13.9 %, and 19.1 % were achieved when incorporating the POS-RNNLM and HPM into the baseline system. They represented relative SER, CER, and WER reductions of 7.7 %, 7.9 % and 5.0 % as comparing with those of the baseline system. Futhermore, the use of the RLM resulted in additional 3 %, 4.6 %, and 4.5 % relative SER, CER, and WER reductions through eliminating reduplication words.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"154 \",\"pages\":\"Article 102983\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001176\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001176","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Toward enriched decoding of mandarin spontaneous speech
A deep neural network (DNN)-based automatic speech recognition (ASR) method for enriched decoding of Mandarin spontaneous speech is proposed. It adopts an enhanced approach over the baseline model built with factored time delay neural networks (TDNN-f) and rescored with RNNLM to first building a baseline system composed of a TDNN-f acoustic model (AM), a trigram language model (LM), and a recurrent neural network language model (RNNLM) to generate a word lattice. It then sequentially incorporates a multi-task Part-of-Speech-RNNLM (POS-RNNLM), a hierarchical prosodic model (HPM), and a reduplication-word LM (RLM) into the decoding process by expanding the word lattice and performing rescoring to improve recognition performance and enrich the decoding output with syntactic parameters of POS and punctuation (PM), prosodic tags of word-juncture break types and syllable prosodic states, and an edited recognition text with reduplication words being eliminated. Experimental results on the Mandarin conversational dialogue corpus (MCDC) showed that SER, CER, and WER of 13.2 %, 13.9 %, and 19.1 % were achieved when incorporating the POS-RNNLM and HPM into the baseline system. They represented relative SER, CER, and WER reductions of 7.7 %, 7.9 % and 5.0 % as comparing with those of the baseline system. Futhermore, the use of the RLM resulted in additional 3 %, 4.6 %, and 4.5 % relative SER, CER, and WER reductions through eliminating reduplication words.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.