Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-05-23 DOI:10.1016/j.specom.2025.103249

Mostafa Shahin, Julien Epps, Beena Ahmed

{"title":"Phonological level wav2vec2-based Mispronunciation Detection and Diagnosis method","authors":"Mostafa Shahin, Julien Epps, Beena Ahmed","doi":"10.1016/j.specom.2025.103249","DOIUrl":null,"url":null,"abstract":"<div><div>The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. Due to the unpredictable nature of pronunciation errors made by non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches can provide only limited diagnostic information about the error made. To address this, in this paper, we propose a low-level MDD approach based on the detection of phonological features. Phonological features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback for the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive phonological features using a single model. The pre-trained wav2vec2 model was employed as a core model for the phonological feature detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed phonological level MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all phonological features compared to the phoneme-level equivalent.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103249"},"PeriodicalIF":3.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000640","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. Due to the unpredictable nature of pronunciation errors made by non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches can provide only limited diagnostic information about the error made. To address this, in this paper, we propose a low-level MDD approach based on the detection of phonological features. Phonological features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback for the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive phonological features using a single model. The pre-trained wav2vec2 model was employed as a core model for the phonological feature detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed phonological level MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all phonological features compared to the phoneme-level equivalent.

查看原文本刊更多论文

基于语音层面wav2vec2的语音错误检测与诊断方法

发音错误的自动识别和分析，被称为发音错误检测和诊断（MDD），在计算机辅助发音学习（CAPL）工具中起着至关重要的作用，如第二语言（L2）学习或语言治疗应用。现有的依赖于分析音素的MDD方法只能检测到有足够数量的训练数据可供建模的音素的分类错误。由于非母语或发音紊乱者的发音错误的不可预测性以及训练数据集的稀缺性，对所有类型的发音错误建模是不可行的。此外，音素级MDD方法只能提供有关所犯错误的有限诊断信息。为了解决这个问题，在本文中，我们提出了一种基于语音特征检测的低级MDD方法。语音特征将音素的产生分解为与发音系统直接相关的基本组成部分，从而为学习者提供更多的形成性反馈。我们进一步提出了连接主义时间分类（CTC）方法的多标签变体，使用单个模型联合建模非互斥语音特征。采用预先训练好的wav2vec2模型作为语音特征检测器的核心模型。将该方法应用于不同母语英语学习者的二语语料库。我们进一步将提出的语音水平MDD方法与传统的音素水平MDD方法进行了比较，在所有音素特征上，与音素水平的等效方法相比，获得了显著较低的错误接受率（FAR）、错误拒绝率（FRR）和诊断错误率（DER）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.