Interspeech最新文献

筛选
英文 中文
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis MISP2021挑战中的视听语音识别:数据集发布和深度分析
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10483
Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan
{"title":"Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis","authors":"Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan","doi":"10.21437/interspeech.2022-10483","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10483","url":null,"abstract":"In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audio-visual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus 1 and the code 2 are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1766-1770"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43761010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees 基于数据增强的Laryntomes端到端无声语音识别
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10868
Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang
{"title":"Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees","authors":"Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang","doi":"10.21437/interspeech.2022-10868","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10868","url":null,"abstract":"Silent speech recognition (SSR) predicts textual information from silent articulation, which is an algorithm design in silent speech interfaces (SSIs). SSIs have the potential of recov-ering the speech ability of individuals who lost their voice but can still articulate (e.g., laryngectomees). Due to the lo-gistic difficulties in articulatory data collection, current SSR studies suffer limited amount of dataset. Data augmentation aims to increase the training data amount by introducing variations into the existing dataset, but has rarely been investigated in SSR for laryngectomees. In this study, we investigated the effectiveness of multiple data augmentation approaches for SSR including consecutive and intermittent time masking, articulatory dimension masking, sinusoidal noise injection and randomly scaling. Different experimental setups including speaker-dependent, speaker-independent, and speaker-adaptive were used. The SSR models were end-to-end speech recognition models trained with connectionist temporal classification (CTC). Electromagnetic articulography (EMA) datasets collected from multiple healthy speakers and laryngectomees were used. The experimental results have demonstrated that the data augmentation approaches explored performed differently, but generally improved SSR performance. Especially, the consecutive time masking has brought significant improvement on SSR for both healthy speakers and laryngectomees.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3653-3657"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43442432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Improving Spoken Language Understanding with Cross-Modal Contrastive Learning 运用跨模态对比学习提高口语理解能力
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-658
Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang
{"title":"Improving Spoken Language Understanding with Cross-Modal Contrastive Learning","authors":"Jingjing Dong, Jiayi Fu, P. Zhou, Hao Li, Xiaorui Wang","doi":"10.21437/interspeech.2022-658","DOIUrl":"https://doi.org/10.21437/interspeech.2022-658","url":null,"abstract":"Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2693-2697"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43733271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation 用于声学回声消除的扬声器和电话感知卷积变压器网络
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10077
Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li
{"title":"Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation","authors":"Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li","doi":"10.21437/interspeech.2022-10077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10077","url":null,"abstract":"Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2513-2517"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43855916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals 基于语音和颈部加速度计信号的语音质量分类卷积神经网络
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10513
Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku
{"title":"Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals","authors":"Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku","doi":"10.21437/interspeech.2022-10513","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10513","url":null,"abstract":"Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5253-5257"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43858994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition 用于语音情感识别的音频与文本相结合的循环多头注意力融合网络
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-888
C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse
{"title":"Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition","authors":"C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse","doi":"10.21437/interspeech.2022-888","DOIUrl":"https://doi.org/10.21437/interspeech.2022-888","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"744-748"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46927325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism 成年自闭症患者在会话中使用与转身和韵律不太同步的点头
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11388
K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue
{"title":"Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism","authors":"K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue","doi":"10.21437/interspeech.2022-11388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11388","url":null,"abstract":"Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1136-1140"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42124598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition 远程语音识别前端系统的弱监督神经全秩空间协方差分析
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11077
Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai
{"title":"Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition","authors":"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai","doi":"10.21437/interspeech.2022-11077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11077","url":null,"abstract":"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3824-3828"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41762710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification 母语语音致音干扰在二语元音加工:鼠标跟踪揭示认知冲突在识别
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-12
Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell
{"title":"Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification","authors":"Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell","doi":"10.21437/interspeech.2022-12","DOIUrl":"https://doi.org/10.21437/interspeech.2022-12","url":null,"abstract":"Regularities of phoneme distribution in a listener’s native language (L1), i.e., L1 phonotactics, can at times induce interference in their perception of second language (L2) phonemes and phonemic strings. This paper presents a study examining phonological interference experienced by L1 Mandarin listeners in identifying the English /i/ vowel in three consonantal contexts /p, f, w/, which have different distributional patterns in Mandarin phonology: /pi/ is a licit sequence in Mandarin, */fi/ is illicit due to co-occurrence restrictions, and */wi/ is illicit due to Mandarin contextual allophony. L1 Mandarin listeners completed two versions of an identification experiment (keystroke and mouse-tracking), in which they identified vowels in different consonantal contexts. Analysis of error rates, response times, and hand motions in the tasks suggests that L1 co-occurrence restriction and contextual allophony induce different levels of phonological interference in L2 vowel perception compared to the licit control condition. In support of the dynamic theory of linguistic cognition, our results indicate that illicit phonotactic contexts can lead to more identification errors, longer decision processes, and spurious activation of a distractor category.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5223-5227"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41796527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis 基于引文注释文本预测基于vqae的角色表演风格用于有声读物语音合成
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-638
Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, H. Saruwatari
{"title":"Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis","authors":"Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, H. Saruwatari","doi":"10.21437/interspeech.2022-638","DOIUrl":"https://doi.org/10.21437/interspeech.2022-638","url":null,"abstract":"We propose a speech-synthesis model for predicting appropriate voice styles on the basis of the character-annotated text for audiobook speech synthesis. An audiobook is more engaging when the narrator makes distinctive voices depending on the story characters. Our goal is to produce such distinctive voices in the speech-synthesis framework. However, such distinction has not been extensively investigated in audiobook speech synthesis. To enable the speech-synthesis model to achieve distinctive voices depending on characters with minimum extra anno-tation, we propose a speech synthesis model to predict character appropriate voices from quotation-annotated text. Our proposed model involves character-acting-style extraction based on a vector quantized variational autoencoder, and style prediction from quotation-annotated texts which enables us to automate audiobook creation with character-distinctive voices from quotation-annotated texts. To the best of our knowledge, this is the first attempt to model intra-speaker voice style depending on character acting for audiobook speech synthesis. We conducted subjective evaluations of our model, and the results indicate that the proposed model generated more distinctive character voices compared to models that do not use the explicit character-acting-style while maintaining the naturalness of synthetic speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4551-4555"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41835460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信