通过ASR评分和声学特征预测口音和可理解性

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-18 DOI:10.1016/j.csl.2025.101858

Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik

{"title":"通过ASR评分和声学特征预测口音和可理解性","authors":"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik","doi":"10.1016/j.csl.2025.101858","DOIUrl":null,"url":null,"abstract":"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101858"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting accentedness and comprehensibility through ASR scores and acoustic features\",\"authors\":\"Wenwei Dong, Catia Cucchiarini, Roeland van Hout, Helmer Strik\",\"doi\":\"10.1016/j.csl.2025.101858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101858\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S088523082500083X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082500083X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

口音和可理解性量表被广泛用于衡量第二语言（L2）学习者的口语能力，包括英语作为第二语言（ESL）学习者。在本文中，我们的重点是通过开发和应用自动测量方法来更好地理解重音性和可理解性的概念。我们提取了分段和超分段（基频、响度、能量等）水平上的特征，以研究哪些特征实际上与专家对口音和可理解性的判断有关。采用基于传统Kaldi时延神经网络（TDNN）模型和端到端耳语模型的自动语音识别（ASR）语音评分，结合日内瓦极简声学参数集（eGeMAPS）和Praat提取的声学特征，采用数据驱动方法。实验结果表明，Whisper优于Kaldi-TDNN模型。Whisper模型在基于电话距离预测可理解性和基于字素距离预测重音性方面的结果最好。结合分段和超分段特征改进了结果，产生了不同的可理解性和重音性特征排名。在分析的最后一步，我们将话语和学习者之间的差异作为随机效应纳入混合线性回归模型。利用这些信息源在预测可理解性和重音性方面取得了实质性的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predicting accentedness and comprehensibility through ASR scores and acoustic features

Accentedness and comprehensibility scales are widely used in measuring the oral proficiency of second language (L2) learners, including learners of English as a Second Language (ESL). In this paper, we focus on gaining a better understanding of the concepts of accentedness and comprehensibility by developing and applying automatic measures to ESL utterances produced by Indonesian learners. We extracted features both on the segmental and the suprasegmental (fundamental frequency, loudness, energy et al.) levels to investigate which features are actually related to expert judgments on accentedness and comprehensibility. Automatic Speech Recognition (ASR) pronunciation scores based on the traditional Kaldi Time Delay Neural Network (TDNN) model and on the End-to-End Whisper model were applied, and data-driven methods were used by combining acoustic features extracted by the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and Praat. The experimental results showed that Whisper outperformed the Kaldi-TDNN model. The Whisper model gave the best results for predicting comprehensibility on the basis of phone distance, and the best results for predicting accentedness on the basis of grapheme distance. Combining segmental and suprasegmental features improved the results, yielding different feature rankings for comprehensibility and accentedness. In our final step of analysis, we included differences between utterances and learners as random effects in a mixed linear regression model. Exploiting these information sources yielded a substantial improvement in predicting both comprehensibility and accentedness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.