Speech Disfluency Detection with Contextual Representation and Data Distillation

Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications Pub Date : 2022-06-25 DOI:10.1145/3539490.3539601

Payal Mohapatra, Akash Pandey, Bashima Islam, Qi Zhu

{"title":"Speech Disfluency Detection with Contextual Representation and Data Distillation","authors":"Payal Mohapatra, Akash Pandey, Bashima Islam, Qi Zhu","doi":"10.1145/3539490.3539601","DOIUrl":null,"url":null,"abstract":"Stuttering affects almost 1\\% of the world's population. It has a deep sociological impact and hinders the people who stutter from taking advantage of voice-assisted services. Automatic stutter detection based on deep learning can help voice assistants to adapt themselves to atypical speech. However, disfluency data is very limited and expensive to generate. We propose a set of preprocessing techniques: (1) using data with high inter-annotator agreement, (2) balancing different classes, and (3) using contextual embeddings from a pretrained network. We then design a disfluency classification network (DisfluencyNet) for automated speech disfluency detection that takes these contextual embeddings as an input. We empirically demonstrate high performance using only a quarter of the data for training. We conduct experiments with different training data size, evaluate the model trained on the lowest amount of training data with SEP-28k baseline results, and evaluate the same model on the FluencyBank dataset baseline results. We observe that, even by using a quarter of the original size of the dataset, our F1 score is greater than 0.7 for all types of disfluencies except one,\\textit{ blocks}. Previous works also reported lower performance with \\textit{blocks} type of disfluency owing to its large diversity amongst speakers and events. Overall, with our approach using only a few minutes of data, we can train a robust network that outperforms the baseline results for all disfluencies by at least 5\\%. Such a result is important to stress the fact that we can now reduce the required amount of training data and are able to improve the quality of the dataset by appointing more than two annotators for labeling speech disfluency within a constrained labeling budget.","PeriodicalId":377149,"journal":{"name":"Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539490.3539601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Stuttering affects almost 1\% of the world's population. It has a deep sociological impact and hinders the people who stutter from taking advantage of voice-assisted services. Automatic stutter detection based on deep learning can help voice assistants to adapt themselves to atypical speech. However, disfluency data is very limited and expensive to generate. We propose a set of preprocessing techniques: (1) using data with high inter-annotator agreement, (2) balancing different classes, and (3) using contextual embeddings from a pretrained network. We then design a disfluency classification network (DisfluencyNet) for automated speech disfluency detection that takes these contextual embeddings as an input. We empirically demonstrate high performance using only a quarter of the data for training. We conduct experiments with different training data size, evaluate the model trained on the lowest amount of training data with SEP-28k baseline results, and evaluate the same model on the FluencyBank dataset baseline results. We observe that, even by using a quarter of the original size of the dataset, our F1 score is greater than 0.7 for all types of disfluencies except one,\textit{ blocks}. Previous works also reported lower performance with \textit{blocks} type of disfluency owing to its large diversity amongst speakers and events. Overall, with our approach using only a few minutes of data, we can train a robust network that outperforms the baseline results for all disfluencies by at least 5\%. Such a result is important to stress the fact that we can now reduce the required amount of training data and are able to improve the quality of the dataset by appointing more than two annotators for labeling speech disfluency within a constrained labeling budget.

查看原文本刊更多论文

基于上下文表示和数据蒸馏的语音不流畅检测

口吃影响着世界上近1％的人口。它具有深刻的社会学影响，阻碍了口吃者利用语音辅助服务。基于深度学习的自动口吃检测可以帮助语音助手适应非典型语音。然而，不流畅数据是非常有限和昂贵的生成。我们提出了一套预处理技术:(1)使用注释者之间高度一致的数据，(2)平衡不同的类，(3)使用来自预训练网络的上下文嵌入。然后，我们设计了一个不流畅分类网络(DisfluencyNet)，用于自动语音不流畅检测，该网络将这些上下文嵌入作为输入。我们仅使用四分之一的数据进行训练，以经验证明了高性能。我们使用不同的训练数据大小进行实验，使用SEP-28k基线结果评估在最低训练数据量上训练的模型，并在FluencyBank数据集基线结果上评估同一模型。我们观察到，即使使用数据集原始大小的四分之一，除了一个\textit{块}之外，我们的F1分数对于所有类型的不流畅都大于0.7。先前的研究也报告了由于演讲者和事件之间的巨大多样性，\textit{block}型不流畅的表现较低。总的来说，我们的方法只需要几分钟的数据，我们就可以训练出一个鲁棒的网络，在所有不流畅的情况下，它的性能至少比基线结果高出5％。这样的结果很重要，可以强调这样一个事实，即我们现在可以减少所需的训练数据量，并能够通过在有限的标注预算内指定两个以上的注释者来标记语音不流畅性，从而提高数据集的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications

自引率

0.00%

发文量