L. Dai, V. Kritskaia, Evelien van der Velden, Merel M. Jung, M. Postma, M. Louwerse
{"title":"Evaluating the usage of Text-To-Speech in K12 education","authors":"L. Dai, V. Kritskaia, Evelien van der Velden, Merel M. Jung, M. Postma, M. Louwerse","doi":"10.1145/3578837.3578864","DOIUrl":null,"url":null,"abstract":"With increased interest in the use of virtual avatars for educational purposes, there is a growing need for high-quality text-to-speech solutions. However, the effects of using synthesized speech in educational applications for younger listeners are still unclear as past findings have been inconsistent and most of them have been obtained in a lab setting with adult assessors. Next to that, it is unclear how much training material is needed for high quality speech synthesis. Particularly for low resource languages, the assumption that good quality synthesized speech requires substantial amounts of vocal recordings to train may be hindering the development of TTS-based solutions. In this study, we created four Dutch text-to-speech (TTS) models from different amounts of training material and evaluated the models in terms of voice perception and recall with K12 students in a classroom environment. Results showed that while the original human voice outperformed the synthesized voices in terms of the listening experience and knowledge test score, more hours of training material did not necessarily result in better outcomes suggesting that 10-15 hours of speech material might be sufficient for training a Dutch TTS. A weak positive correlation was found between listening experience and knowledge test performance, with the low listening effort being the most important factor. This outcome suggests that comprehensibility is likely the most important TTS feature for educational applications.","PeriodicalId":150970,"journal":{"name":"Proceedings of the 2022 6th International Conference on Education and E-Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Education and E-Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578837.3578864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With increased interest in the use of virtual avatars for educational purposes, there is a growing need for high-quality text-to-speech solutions. However, the effects of using synthesized speech in educational applications for younger listeners are still unclear as past findings have been inconsistent and most of them have been obtained in a lab setting with adult assessors. Next to that, it is unclear how much training material is needed for high quality speech synthesis. Particularly for low resource languages, the assumption that good quality synthesized speech requires substantial amounts of vocal recordings to train may be hindering the development of TTS-based solutions. In this study, we created four Dutch text-to-speech (TTS) models from different amounts of training material and evaluated the models in terms of voice perception and recall with K12 students in a classroom environment. Results showed that while the original human voice outperformed the synthesized voices in terms of the listening experience and knowledge test score, more hours of training material did not necessarily result in better outcomes suggesting that 10-15 hours of speech material might be sufficient for training a Dutch TTS. A weak positive correlation was found between listening experience and knowledge test performance, with the low listening effort being the most important factor. This outcome suggests that comprehensibility is likely the most important TTS feature for educational applications.