{"title":"直接使用词作为语音识别声学建模单元的探索","authors":"Chunlei Zhang, Chengzhu Yu, Chao Weng, Jia Cui, Dong Yu","doi":"10.1109/SLT.2018.8639623","DOIUrl":null,"url":null,"abstract":"Conventional acoustic models for automatic speech recognition (ASR) are usually constructed from sub-word unit (e.g., context-dependent phoneme, grapheme, wordpiece etc.). Recent studies demonstrate that connectionist temporal classification (CTC) based acoustic-to-word (A2W) models are also promising for ASR. Such structures have drawn increasing attention as they can directly target words as output units, which simplify ASR pipeline by avoiding additional pronunciation lexicon, or even language model. In this study, we systematically explore to use word as acoustic modeling unit for conversational speech recognition. By replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, we greatly simplify conventional hybrid speech recognition system. On Hub5-2000 Switchboard/CallHome test sets with 300-hour training data, we achieve a WER that is close to the senone based hybrid systems with a WFST based decoding.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition\",\"authors\":\"Chunlei Zhang, Chengzhu Yu, Chao Weng, Jia Cui, Dong Yu\",\"doi\":\"10.1109/SLT.2018.8639623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional acoustic models for automatic speech recognition (ASR) are usually constructed from sub-word unit (e.g., context-dependent phoneme, grapheme, wordpiece etc.). Recent studies demonstrate that connectionist temporal classification (CTC) based acoustic-to-word (A2W) models are also promising for ASR. Such structures have drawn increasing attention as they can directly target words as output units, which simplify ASR pipeline by avoiding additional pronunciation lexicon, or even language model. In this study, we systematically explore to use word as acoustic modeling unit for conversational speech recognition. By replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, we greatly simplify conventional hybrid speech recognition system. On Hub5-2000 Switchboard/CallHome test sets with 300-hour training data, we achieve a WER that is close to the senone based hybrid systems with a WFST based decoding.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639623\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition
Conventional acoustic models for automatic speech recognition (ASR) are usually constructed from sub-word unit (e.g., context-dependent phoneme, grapheme, wordpiece etc.). Recent studies demonstrate that connectionist temporal classification (CTC) based acoustic-to-word (A2W) models are also promising for ASR. Such structures have drawn increasing attention as they can directly target words as output units, which simplify ASR pipeline by avoiding additional pronunciation lexicon, or even language model. In this study, we systematically explore to use word as acoustic modeling unit for conversational speech recognition. By replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, we greatly simplify conventional hybrid speech recognition system. On Hub5-2000 Switchboard/CallHome test sets with 300-hour training data, we achieve a WER that is close to the senone based hybrid systems with a WFST based decoding.