一个动态的、自监督的、大规模的口吃语音视听数据集

Proceedings of the 1st International Workshop on Multimodal Conversational AI Pub Date : 2020-10-16 DOI:10.1145/3423325.3423733

Mehmet Altinkaya, A. Smeulders

{"title":"一个动态的、自监督的、大规模的口吃语音视听数据集","authors":"Mehmet Altinkaya, A. Smeulders","doi":"10.1145/3423325.3423733","DOIUrl":null,"url":null,"abstract":"Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones. Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery. In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech\",\"authors\":\"Mehmet Altinkaya, A. Smeulders\",\"doi\":\"10.1145/3423325.3423733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones. Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery. In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.\",\"PeriodicalId\":142947,\"journal\":{\"name\":\"Proceedings of the 1st International Workshop on Multimodal Conversational AI\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st International Workshop on Multimodal Conversational AI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3423325.3423733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3423325.3423733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

口吃影响着世界上至少1%的人口。它是由语音产生的不规则中断引起的。这些中断以各种形式和频率发生。最常见的是重复单词或单词的一部分，延长或发音障碍。准确的检测和分类口吃对于评估言语治疗的严重程度是很重要的。此外，实时检测可能会创造许多新的可能性，以促进重建为流利的语音。这样的界面可以帮助人们利用苹果Siri和b谷歌Assistant等基于语音的界面，或者通过延迟传输使(视频)电话通话更流畅。在本文中，我们提出了第一个可扩展的口吃语音视听数据库。我们探索了一个端到端、实时、多模态的模型，用于检测和分类非绑定语音中的口吃块。由于声音信号不能立即产生，我们也利用了视频信号。我们使用多种模态作为声信号以及视觉信号中显示的次要特征，这将提高检测的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones. Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery. In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st International Workshop on Multimodal Conversational AI

自引率

0.00%

发文量