YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection.

Interspeech Pub Date : 2024-09-01 DOI:10.21437/interspeech.2024-1855

Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

{"title":"YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection.","authors":"Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli","doi":"10.21437/interspeech.2024-1855","DOIUrl":null,"url":null,"abstract":"Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"937-941"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226351/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2024-1855","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.

查看原文本刊更多论文

YOLO-Stutter：端到端的区域智能语言障碍检测。

语音异常检测是语音异常分析和口语学习的瓶颈。当前最先进的模型是由基于规则的系统控制的[1,2]，这些系统缺乏效率和鲁棒性，并且对模板设计很敏感。在本文中，我们提出了YOLO-Stutter：第一种端到端方法，以时间精确的方式检测不流畅。YOLO-Stutter将不完美的语音-文本对齐作为输入，然后使用空间特征聚合器和时间依赖提取器来执行区域边界和类别预测。我们还介绍了两个语言障碍语料库，VCTK-Stutter和VCTK-TTS，模拟自然的口语障碍，包括重复，块，缺失，替换和延长。我们的端到端方法在模拟数据和真实失语症语音中使用最少数量的可训练参数实现了最先进的性能。代码和数据集在https://github.com/rorizzz/YOLO-Stutter上开源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量