Automatic Disfluency Detection From Untranscribed Speech

IF 4.1 2区 计算机科学 Q1 ACOUSTICS
Amrit Romana;Kazuhito Koishida;Emily Mower Provost
{"title":"Automatic Disfluency Detection From Untranscribed Speech","authors":"Amrit Romana;Kazuhito Koishida;Emily Mower Provost","doi":"10.1109/TASLP.2024.3485465","DOIUrl":null,"url":null,"abstract":"Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. All speakers experience disfluencies at times, and the rate at which we produce disfluencies may be increased by certain speaker or environmental characteristics. Modeling disfluencies has been shown to be useful for a range of downstream tasks, and as a result, disfluency detection has many potential applications. In this work, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4727-4740"},"PeriodicalIF":4.1000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10731569/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. All speakers experience disfluencies at times, and the rate at which we produce disfluencies may be increased by certain speaker or environmental characteristics. Modeling disfluencies has been shown to be useful for a range of downstream tasks, and as a result, disfluency detection has many potential applications. In this work, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.
从未记录的语音中自动检测不流畅语句
说话不流畅,如停顿或重复,是典型语流的中断。所有说话者有时都会出现语无伦次的情况,某些说话者或环境特征可能会增加我们产生语无伦次的速度。对不连贯现象进行建模已被证明对一系列下游任务有用,因此,不连贯现象检测有许多潜在的应用。在这项工作中,我们研究了语言、声学和多模态方法,用于帧级不流畅语自动检测和分类。每种方法都依赖音频作为输入。首先,我们评估了几种自动语音识别(ASR)系统转录不流畅语句的能力,衡量标准是不流畅语句错误率。然后,我们将这些 ASR 转录结果作为基于语言的不流畅检测模型的输入。我们发现,不流畅语检测性能在很大程度上受到转录本和对齐质量的限制。我们发现,无需转录作为中间步骤的声学方法优于 ASR 语言方法。最后,我们提出了多模态架构,发现这种架构比单模态方法更能提高不流利检测性能。最终,这项工作为自动帧级不流畅和分类引入了新方法。从长远来看,这将有助于研究人员将不流畅自动检测纳入一系列应用中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信