Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2236

Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier

{"title":"Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data.","authors":"Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier","doi":"10.21437/interspeech.2024-2236","DOIUrl":"10.21437/interspeech.2024-2236","url":null,"abstract":"Magnetic Resonance Imaging (MRI) allows analyzing speech production by capturing high-resolution images of the dynamic processes in the vocal tract. In clinical applications, combining MRI with synchronized speech recordings leads to improved patient outcomes, especially if a phonological-based approach is used for assessment. However, when audio signals are unavailable, the recognition accuracy of sounds is decreased when using only MRI data. We propose a contrastive learning approach to improve the detection of phonological classes from MRI data when acoustic signals are not available at inference time. We demonstrate that frame-wise recognition of phonological classes improves from an f1 of 0.74 to 0.85 when the contrastive loss approach is implemented. Furthermore, we show the utility of our approach in the clinical application of using such phonological classes to assess speech disorders in patients with tongue cancer, yielding promising results in the recognition task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"927-931"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection. 用于轻度认知障碍检测的自发语音助手命令的多模态特征分析。

Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2288

Nana Lin, Youxiang Zhu, Xiaohui Liang, John A Batsis, Caroline Summerour

引用次数: 0

Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. 认知风险因素分类的分段和超分段语音基础模型：评估开箱即用的性能。

Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2063

Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha

{"title":"Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance.","authors":"Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha","doi":"10.21437/interspeech.2024-2063","DOIUrl":"10.21437/interspeech.2024-2063","url":null,"abstract":"Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"917-921"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. YOLO-Stutter：端到端的区域智能语言障碍检测。

Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-1855

Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

{"title":"YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection.","authors":"Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli","doi":"10.21437/interspeech.2024-1855","DOIUrl":"10.21437/interspeech.2024-1855","url":null,"abstract":"Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"937-941"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226351/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144577143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech? 对齐错误如何影响儿童语音自动评分？

Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2239

Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha

{"title":"How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?","authors":"Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha","doi":"10.21437/interspeech.2024-2239","DOIUrl":"10.21437/interspeech.2024-2239","url":null,"abstract":"Automated goodness of pronunciation scores measure deviation from typical adult speech by first phonetically segmenting speech using forced alignment and then computing phoneme likelihoods. Care must be taken to distinguish between the impact of alignment error (a spurious signal) and true acoustic deviation on the automated score. Using mixed effects modeling, we predict <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , the difference between pronunciation scores computed using manual alignment ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>m</mi></mrow> </msub> </math> ) versus computed using automatic forced alignments ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>a</mi></mrow> </msub> </math> ). Pronunciation deviations and alignment error are both magnified in children's speech and may be influenced by factors such as phoneme position and phoneme type. Our methodology shows that alignment error has a moderate effect on <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , and other variables have small to no effect. Manual <math><mi>PLLR</mi></math> closely matches automatically calculated <math><mi>PLLR</mi></math> following cross utterance averaging. Thus, practical comparisons between child speakers should be very comparable across the two methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"5133-5137"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11977302/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. 比较有和没有发声功能亢进的说话者在日常生活中的动态声音测量和简短的实验室评估。

Interspeech Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-1484

Daryush D Mehta, Jarrad H Van Stan, Hamzeh Ghasemzadeh, Robert E Hillman

引用次数: 0

Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance. 使用多模式对话代理对ALS进行远程评估：数据质量、可行性和任务符合性。

Interspeech Pub Date : 2023-08-01 DOI: 10.21437/interspeech.2023-2115

Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan

{"title":"Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance.","authors":"Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan","doi":"10.21437/interspeech.2023-2115","DOIUrl":"https://doi.org/10.21437/interspeech.2023-2115","url":null,"abstract":"We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2023 ","pages":"5441-5445"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547018/pdf/nihms-1931217.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer 考虑语言迁移影响的普通话ASR外来词语音建模

Interspeech Pub Date : 2022-10-07 DOI: 10.21437/Interspeech.2014-353

Lei Wang, R. Tong

引用次数: 3

Automatic Speaker Verification System for Dysarthria Patients 用于构音障碍患者的自动扬声器验证系统

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-375

Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad

{"title":"Automatic Speaker Verification System for Dysarthria Patients","authors":"Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad","doi":"10.21437/interspeech.2022-375","DOIUrl":"https://doi.org/10.21437/interspeech.2022-375","url":null,"abstract":"Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5070-5074"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44912875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics 基于发声特征的新型冠状病毒咳嗽检测鲁棒咳嗽特征提取与分类方法

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10401

Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun

{"title":"Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics","authors":"Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun","doi":"10.21437/interspeech.2022-10401","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10401","url":null,"abstract":"A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2168-2172"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45011547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Interspeech最新文献