自动语音识别（ASR）用于诊断韩国儿童的语音发音障碍。

IF 1 4区医学 Q4 AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY

Clinical Linguistics & Phonetics Pub Date : 2025-10-01 Epub Date: 2024-08-20 DOI:10.1080/02699206.2024.2387609

Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-Ra Cho, Hosung Nam, Dae-Hyun Jang

{"title":"自动语音识别（ASR）用于诊断韩国儿童的语音发音障碍。","authors":"Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-Ra Cho, Hosung Nam, Dae-Hyun Jang","doi":"10.1080/02699206.2024.2387609","DOIUrl":null,"url":null,"abstract":"This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model's Phoneme Error Rate (PER) was only 10% when its predictions of children's pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.","PeriodicalId":49219,"journal":{"name":"Clinical Linguistics & Phonetics","volume":" ","pages":"913-926"},"PeriodicalIF":1.0000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.\",\"authors\":\"Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-Ra Cho, Hosung Nam, Dae-Hyun Jang\",\"doi\":\"10.1080/02699206.2024.2387609\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model's Phoneme Error Rate (PER) was only 10% when its predictions of children's pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.\",\"PeriodicalId\":49219,\"journal\":{\"name\":\"Clinical Linguistics & Phonetics\",\"volume\":\" \",\"pages\":\"913-926\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Linguistics & Phonetics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/02699206.2024.2387609\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/8/20 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q4\",\"JCRName\":\"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Linguistics & Phonetics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/02699206.2024.2387609","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/20 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

本研究提出了一种自动语音识别（ASR）模型，旨在诊断语音障碍（SSD）儿童的发音问题，以取代临床程序中的人工转录。由于为一般目的训练的自动语音识别模型主要是将输入语音预测为标准拼写单词，因此著名的高性能自动语音识别模型并不适合评估 SSD 儿童的发音。我们对 wav2vec2.0 XLS-R 模型进行了微调，以识别儿童发音的单词，而不是将语音转换为标准拼写单词。我们使用 137 名患有 SSD 的儿童的语音数据集对该模型进行了微调，这些数据集包含 73 个韩语单词的发音，这些单词是根据实际临床诊断选择的。当该模型对儿童发音的预测与人类发音注释进行比较时，其音素错误率（PER）仅为 10%。相比之下，最先进的 ASR 模型 Whisper 虽然在一般任务中表现出色，但在识别 SSD 儿童的语音时却表现出局限性，PER 约为 50%。虽然该模型在识别不清晰发音方面仍需改进，但这项研究表明，ASR 模型可以简化临床领域复杂的发音错误诊断程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model's Phoneme Error Rate (PER) was only 10% when its predictions of children's pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical Linguistics & Phonetics AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY-REHABILITATION

CiteScore

2.70

自引率

16.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Clinical Linguistics & Phonetics encompasses the following: Linguistics and phonetics of disorders of speech and language; Contribution of data from communication disorders to theories of speech production and perception; Research on communication disorders in multilingual populations, and in under-researched populations, and languages other than English; Pragmatic aspects of speech and language disorders; Clinical dialectology and sociolinguistics; Childhood, adolescent and adult disorders of communication; Linguistics and phonetics of hearing impairment, sign language and lip-reading.