Direct articulatory observation reveals phoneme recognition performance characteristics of a self-supervised speech model.

IF 1.4 Q3 ACOUSTICS

JASA express letters Pub Date : 2024-11-01 DOI:10.1121/10.0034430

Xuan Shi, Tiantian Feng, Kevin Huang, Sudarsana Reddy Kadiri, Jihwan Lee, Yijing Lu, Yubin Zhang, Louis Goldstein, Shrikanth Narayanan

引用次数: 0

Abstract

Variability in speech pronunciation is widely observed across different linguistic backgrounds, which impacts modern automatic speech recognition performance. Here, we evaluate the performance of a self-supervised speech model in phoneme recognition using direct articulatory evidence. Findings indicate significant differences in phoneme recognition, especially in front vowels, between American English and Indian English speakers. To gain a deeper understanding of these differences, we conduct real-time MRI-based articulatory analysis, revealing distinct velar region patterns during the production of specific front vowels. This underscores the need to deepen the scientific understanding of self-supervised speech model variances to advance robust and inclusive speech technology.

查看原文本刊更多论文

直接发音观察揭示了自监督语音模型的音素识别性能特征。

不同语言背景下的语音发音差异很大，这影响了现代自动语音识别的性能。在此，我们利用直接发音证据评估了自监督语音模型在音素识别方面的性能。研究结果表明，美式英语和印度英语发音人在音素识别方面存在明显差异，尤其是前元音。为了更深入地了解这些差异，我们进行了基于核磁共振成像的实时发音分析，揭示了在发出特定前元音时不同的 velar 区域模式。这突出表明，有必要加深对自监督语音模型差异的科学理解，以推进稳健而包容的语音技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JASA express letters

CiteScore

1.70

自引率

0.00%

发文量