利用深度神经网络加强对双声道语音的分析

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-09-02 DOI:10.1016/j.csl.2024.101715

Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet

{"title":"利用深度神经网络加强对双声道语音的分析","authors":"Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet","doi":"10.1016/j.csl.2024.101715","DOIUrl":null,"url":null,"abstract":"<div><p>Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101715"},"PeriodicalIF":3.4000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing analysis of diadochokinetic speech using deep neural networks\",\"authors\":\"Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet\",\"doi\":\"10.1016/j.csl.2024.101715\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"90 \",\"pages\":\"Article 101715\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824000986\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000986","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

声动力言语任务（DDK）涉及辅音-元音音节的重复发音。这些任务有助于检测言语运动障碍、鉴别诊断和监测进展。然而，对这些任务进行人工分析既费时又主观，而且只能提供一个粗略的语音图像。本文介绍了几种深度神经网络模型，这些模型可处理原始波形，用于自动分割未注释和未转录语音中的停止辅音和元音。深度编码器可作为特征提取模块，取代传统的信号处理特征。在这种情况下，不同的深度学习架构，如卷积神经网络（CNN）和大型自监督模型（如 HuBERT），被应用于提取过程。解码器模型使用衍生嵌入来识别帧类型。因此，本文研究了各种深度架构，包括线性层、LSTM、CNN 和变换器。本文评估了这些架构在健康人数据集和帕金森病老年患者未见数据集上检测语音速率、声音持续时间和边界位置的能力。结果表明，在这两个数据集上，LSTM 模型的表现优于所有其他模型，并可与训练有素的人类标注者相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enhancing analysis of diadochokinetic speech using deep neural networks

Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.