Kenichi Arai, A. Ogawa, S. Araki, K. Kinoshita, T. Nakatani, Naoyuki Kamo, T. Irino
{"title":"基于端到端ASR系统识别精度的增强语音可理解性预测","authors":"Kenichi Arai, A. Ogawa, S. Araki, K. Kinoshita, T. Nakatani, Naoyuki Kamo, T. Irino","doi":"10.23919/APSIPAASC55919.2022.9980257","DOIUrl":null,"url":null,"abstract":"We propose speech intelligibility (SI) prediction methods using the recognition accuracy of an end-to-end (E2E) automatic speech recognition (ASR) system whose ASR performance has become comparable to the human auditory system due to its recent significant progress. Such predictors will fuel the development of speech enhancement methods for human listeners. In this paper, we evaluate our proposed method's prediction performance of the intelligibility of enhanced noisy speech signals. Our experiments show that when ASR systems are trained with various noisy speech data, our proposed methods, which do not require clean reference signals, predict SI more accurately than the existing “intrusive” methods: short-time objective intelligibility (STOI), extended-STOI (eSTOI), and our previously proposed methods, which were based on deep neural network-hidden Markov model hybrid ASR systems. Our experiments also show that our method, which additionally uses clean speech for determining the speech region of evaluation signals, further improves the prediction accuracy more than the existing methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"280 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Intelligibility prediction of enhanced speech using recognition accuracy of end-to-end ASR systems\",\"authors\":\"Kenichi Arai, A. Ogawa, S. Araki, K. Kinoshita, T. Nakatani, Naoyuki Kamo, T. Irino\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980257\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose speech intelligibility (SI) prediction methods using the recognition accuracy of an end-to-end (E2E) automatic speech recognition (ASR) system whose ASR performance has become comparable to the human auditory system due to its recent significant progress. Such predictors will fuel the development of speech enhancement methods for human listeners. In this paper, we evaluate our proposed method's prediction performance of the intelligibility of enhanced noisy speech signals. Our experiments show that when ASR systems are trained with various noisy speech data, our proposed methods, which do not require clean reference signals, predict SI more accurately than the existing “intrusive” methods: short-time objective intelligibility (STOI), extended-STOI (eSTOI), and our previously proposed methods, which were based on deep neural network-hidden Markov model hybrid ASR systems. Our experiments also show that our method, which additionally uses clean speech for determining the speech region of evaluation signals, further improves the prediction accuracy more than the existing methods.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"280 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980257\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980257","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Intelligibility prediction of enhanced speech using recognition accuracy of end-to-end ASR systems
We propose speech intelligibility (SI) prediction methods using the recognition accuracy of an end-to-end (E2E) automatic speech recognition (ASR) system whose ASR performance has become comparable to the human auditory system due to its recent significant progress. Such predictors will fuel the development of speech enhancement methods for human listeners. In this paper, we evaluate our proposed method's prediction performance of the intelligibility of enhanced noisy speech signals. Our experiments show that when ASR systems are trained with various noisy speech data, our proposed methods, which do not require clean reference signals, predict SI more accurately than the existing “intrusive” methods: short-time objective intelligibility (STOI), extended-STOI (eSTOI), and our previously proposed methods, which were based on deep neural network-hidden Markov model hybrid ASR systems. Our experiments also show that our method, which additionally uses clean speech for determining the speech region of evaluation signals, further improves the prediction accuracy more than the existing methods.