Interspeech最新文献

筛选
英文 中文
ASR-Robust Natural Language Understanding on ASR-GLUE dataset 基于ASR-GLUE数据集的ASR鲁棒自然语言理解
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10097
Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng
{"title":"ASR-Robust Natural Language Understanding on ASR-GLUE dataset","authors":"Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng","doi":"10.21437/interspeech.2022-10097","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10097","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43470194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text 一种使用非配对语音和文本的互补联合训练方法
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-291
Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang
{"title":"A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text","authors":"Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang","doi":"10.21437/interspeech.2022-291","DOIUrl":"https://doi.org/10.21437/interspeech.2022-291","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43520324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effects of Noise on Speech Perception and Spoken Word Comprehension 噪声对语音感知和口语理解的影响
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10543
Jovan Eranovic, D. Pape, M. Stroińska, E. Service, Marijana Matkovski
{"title":"Effects of Noise on Speech Perception and Spoken Word Comprehension","authors":"Jovan Eranovic, D. Pape, M. Stroińska, E. Service, Marijana Matkovski","doi":"10.21437/interspeech.2022-10543","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10543","url":null,"abstract":"The aim of the study was to find out which of the three categories of noise acting as maskers ( energetic : masking portions of the target speech with its energy; informational : both target and masker compete for the listener’s attention; degraded : reverberated or filtered speech) is most detrimental to speech perception and spoken word comprehension. To that end, participants completed three tasks with and without added noise – listening span, listening comprehension, and shadowing – where shadowing is considered primarily a task relying on speech perception, with the other two tasks considered to rely on word comprehension and semantic inference. The study found informational masking to be most detrimental to speech perception, while energetic masking and sound degradation were most detrimental to spoken word comprehension. The results also imply that masking categories must be used with caution, since not all maskers belonging to one category had the same effect on performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46737644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach 非言语语音的状态和特质测量:一种多任务联合学习方法
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10927
Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, D. Keltner, Alan S. Cowen
{"title":"State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach","authors":"Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, D. Keltner, Alan S. Cowen","doi":"10.21437/interspeech.2022-10927","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10927","url":null,"abstract":"Humans infer a wide array of meanings from expressive nonverbal vocalizations, e.g., laughs, cries, and sighs. Thus far, computational research has primarily focused on the coarse classification of vocalizations such as laughs, but that approach overlooks significant variations in the meaning of distinct laughs (e.g., amusement, awkwardness, triumph) and the rich array of more nuanced vocalizations people form. Nonverbal vocalizations are shaped by the emotional state an individual chooses to convey, their wellbeing, and (as with the voice more broadly) their identity-related traits. In the present study, we utilize a large-scale dataset comprising more than 35 hours of densely labeled vocal bursts to model emotionally expressive states and demographic traits from nonverbal vocalizations. We compare a single-task and multi-task deep learning architecture to explore how models can leverage acoustic co-dependencies that may exist between the expression of 10 emotions by vocal bursts and the demographic traits of the speaker. Results show that nonverbal vocalizations can be reliably leveraged to predict emotional expression, age, and country of origin. In a multi-task setting, our experiments show that joint learning of emotional expression and demographic traits appears to yield robust results, primarily beneficial for the classification of a speaker’s country of origin.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46758450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems 非自回归端到端多语TTS系统的表达性迁移分析
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10388
Ajinkya Kulkarni, Vincent Colotte, D. Jouvet
{"title":"Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems","authors":"Ajinkya Kulkarni, Vincent Colotte, D. Jouvet","doi":"10.21437/interspeech.2022-10388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10388","url":null,"abstract":"The main objective of this work is to study the expressivity transfer in a speaker’s voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46778202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A blueprint for using deepfakes in sociolinguistic matched-guise experiments 在社会语言学匹配伪装实验中使用深度伪造的蓝图
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10782
N. Young, D. Britain, A. Leemann
{"title":"A blueprint for using deepfakes in sociolinguistic matched-guise experiments","authors":"N. Young, D. Britain, A. Leemann","doi":"10.21437/interspeech.2022-10782","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10782","url":null,"abstract":"Matched-guise paradigms, which are used extensively in speaker and accent evaluation studies, have long been ham-pered by empirical holes. We offer a solution by incorporating deepfake technology, which greatly reduces the number of potential confounds. We constructed a sociophonetic experiment whereby high-rising terminal (a.k.a. ªuptalkº) ± and the lack thereof ± was superimposed onto a deepfaked ªbeautifulº and ªless beautifulº female guise. The resulting four guises were in-corporated into a 2x2-factor between-subjects experiment tested on female evaluators. Each evaluator assessed their respective guise against a list of prescribed attributes and offered free-form comments. Results align with studies on high-rising terminal as well as intuitions concerning conventional beauty, which validates the technique and motivates its wider adoption.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43983113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Audio-Visual Neural Speaker Diarization 端到端视听神经扬声器日记
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10106
Maokui He, Jun Du, Chin-Hui Lee
{"title":"End-to-End Audio-Visual Neural Speaker Diarization","authors":"Maokui He, Jun Du, Chin-Hui Lee","doi":"10.21437/interspeech.2022-10106","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10106","url":null,"abstract":"In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classification output layers produces activities of each speaker. With the finely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades significantly using the visual-only model. Evaluated on the datasets of the first multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44341659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings 直接操纵编码器输出嵌入的端到端TTS模型的说话速率控制
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-759
Martin Lenglet, O. Perrotin, G. Bailly
{"title":"Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings","authors":"Martin Lenglet, O. Perrotin, G. Bailly","doi":"10.21437/interspeech.2022-759","DOIUrl":"https://doi.org/10.21437/interspeech.2022-759","url":null,"abstract":"Since neural Text-To-Speech models have achieved such high standards in terms of naturalness, the main focus of the field has gradually shifted to gaining more control over the expressiveness of the synthetic voices. One of these leverages is the control of the speaking rate that has become harder for a human operator to control since the introduction of neural attention networks to model speech dynamics. While numerous models have reintroduced an explicit duration control (ex: Fast-Speech2), these models generally rely on additional tasks to complete during their training. In this paper, we show how an acoustic analysis of the internal embeddings delivered by the encoder of an unsupervised end-to-end TTS Tacotron2 model is enough to identify and control some acoustic parameters of interest. Specifically, we compare this speaking rate control with the duration control offered by a supervised FastSpeech2 model. Experimental results show that the control provided by embeddings reproduces a behaviour closer to natural speech data.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44434627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions 在临床儿童看护会议中通过轮流分析自动检测反应性依恋障碍
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-387
Andréi Birladeanu, H. Minnis, A. Vinciarelli
{"title":"Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions","authors":"Andréi Birladeanu, H. Minnis, A. Vinciarelli","doi":"10.21437/interspeech.2022-387","DOIUrl":"https://doi.org/10.21437/interspeech.2022-387","url":null,"abstract":"To the best of our knowledge, this is the first work aimed at automatic detection of Reactive Attachment Disorder, a psychiatric issue typically affecting children that experienced abuse and neglect. The proposed approach is based on the analysis of turn-taking during clinical sessions and the experiments involved 61 children and their caregivers. The results show that it is possible to detect the pathology with accuracy up to 69.2% (F1 Score 68.8%). In addition, the experiments show that the pathology tends to leave different behavioral traces in different activities. This might explain why Reactive Attachment Disorder is difficult to diagnose and tends to remain undetected. In such a context, methodologies like those proposed in this work can be a valuable support in clinical practice.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44459472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ant Multilingual Recognition System for OLR 2021 Challenge 面向OLR 2021挑战赛的蚂蚁多语言识别系统
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-355
Anqi Lyu, Zhiming Wang, Huijia Zhu
{"title":"Ant Multilingual Recognition System for OLR 2021 Challenge","authors":"Anqi Lyu, Zhiming Wang, Huijia Zhu","doi":"10.21437/interspeech.2022-355","DOIUrl":"https://doi.org/10.21437/interspeech.2022-355","url":null,"abstract":"This paper presents a comprehensive description of the Ant multilingual recognition system for the 6th Oriental Language Recognition(OLR 2021) Challenge. Inspired by the transfer learning scheme, the encoder components of language iden-tification(LID) model is initialized from pretrained automatic speech recognition(ASR) networks for integrating the lexical phonetic information into language identification. The ASR model is encoder-decoder networks based on U2++ architecture [1]; then inheriting the shared conformer encoder [2] from pretrained ASR model which is effective at global information capturing and local invariance modeling, the LID model, with an attentive statistical pooling layer and a following linear projection layer added on the encoder, is further finetuned until its optimum. Furthermore, data augmentation, score normalization and model ensemble are good strategies to improve performance indicators, which are investigated and analysed in detail within our paper. In the OLR 2021 Challenge, our submitted systems ranked the top in both tasks 1 and 2 with primary met-rics of 0.0025 and 0.0039 respectively, less than 1/3 of the second place 1 , which fully illustrates that our methodologies for multilingual identification are effectual and competitive in real-life scenarios.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45084675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信