{"title":"韵律如何影响会话奥地利德语的ASR表现","authors":"Saskia Wepner, Barbara Schuppler, G. Kubin","doi":"10.21437/speechprosody.2022-40","DOIUrl":null,"url":null,"abstract":"Currently available Automatic Speech Recognition (ASR) systems achieve good word error rates (WER) for read speech ( 2 − 10% ), but not for conversational speech ( 20 − 40% ), a speaking style especially relevant for dialogue systems, as they become more conversational and interactional. Here, we anal-yse how prosody affects WER in a Kaldi-based speech recognition system for a corpus of conversational Austrian German. This analysis is a step towards improving ASR systems and increasing our knowledge about which aspects are relevant to consider for ASR of conversational speech. For this purpose, we compare a typical language model (LM) with an oracle LM trained on the utterances from the whole corpus, thus knowing each possible N -gram in advance. We find that short, deaccented words have the lowest recognition accuracy, which also cannot be compensated for by the oracle LM. Despite our over-all high WERs, the highly prominent words were recognised significantly better. Our findings suggest that reporting global WERs for an ASR system of conversational speech does not predict its usefulness in dialogue systems. Given the role of prominent words in carrying meaning and function in conver-sation, our analysis is relevant for researchers developing automatic speech understanding systems.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"How prosody affects ASR performance in conversational Austrian German\",\"authors\":\"Saskia Wepner, Barbara Schuppler, G. Kubin\",\"doi\":\"10.21437/speechprosody.2022-40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Currently available Automatic Speech Recognition (ASR) systems achieve good word error rates (WER) for read speech ( 2 − 10% ), but not for conversational speech ( 20 − 40% ), a speaking style especially relevant for dialogue systems, as they become more conversational and interactional. Here, we anal-yse how prosody affects WER in a Kaldi-based speech recognition system for a corpus of conversational Austrian German. This analysis is a step towards improving ASR systems and increasing our knowledge about which aspects are relevant to consider for ASR of conversational speech. For this purpose, we compare a typical language model (LM) with an oracle LM trained on the utterances from the whole corpus, thus knowing each possible N -gram in advance. We find that short, deaccented words have the lowest recognition accuracy, which also cannot be compensated for by the oracle LM. Despite our over-all high WERs, the highly prominent words were recognised significantly better. Our findings suggest that reporting global WERs for an ASR system of conversational speech does not predict its usefulness in dialogue systems. Given the role of prominent words in carrying meaning and function in conver-sation, our analysis is relevant for researchers developing automatic speech understanding systems.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"67 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
How prosody affects ASR performance in conversational Austrian German
Currently available Automatic Speech Recognition (ASR) systems achieve good word error rates (WER) for read speech ( 2 − 10% ), but not for conversational speech ( 20 − 40% ), a speaking style especially relevant for dialogue systems, as they become more conversational and interactional. Here, we anal-yse how prosody affects WER in a Kaldi-based speech recognition system for a corpus of conversational Austrian German. This analysis is a step towards improving ASR systems and increasing our knowledge about which aspects are relevant to consider for ASR of conversational speech. For this purpose, we compare a typical language model (LM) with an oracle LM trained on the utterances from the whole corpus, thus knowing each possible N -gram in advance. We find that short, deaccented words have the lowest recognition accuracy, which also cannot be compensated for by the oracle LM. Despite our over-all high WERs, the highly prominent words were recognised significantly better. Our findings suggest that reporting global WERs for an ASR system of conversational speech does not predict its usefulness in dialogue systems. Given the role of prominent words in carrying meaning and function in conver-sation, our analysis is relevant for researchers developing automatic speech understanding systems.