Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik
{"title":"通过ASR评分和声学特征预测口音和可理解性","authors":"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik","doi":"10.1016/j.csl.2025.101858","DOIUrl":null,"url":null,"abstract":"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101858"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting accentedness and comprehensibility through ASR scores and acoustic features\",\"authors\":\"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik\",\"doi\":\"10.1016/j.csl.2025.101858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101858\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S088523082500083X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082500083X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Predicting accentedness and comprehensibility through ASR scores and acoustic features
Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.