InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10097
Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng
{"title":"ASR-Robust Natural Language Understanding on ASR-GLUE dataset","authors":"Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng","doi":"10.21437/interspeech.2022-10097","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10097","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43470194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-291
Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang
{"title":"A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text","authors":"Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang","doi":"10.21437/interspeech.2022-291","DOIUrl":"https://doi.org/10.21437/interspeech.2022-291","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43520324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10543
Jovan Eranovic, D. Pape, M. Stroińska, E. Service, Marijana Matkovski
{"title":"Effects of Noise on Speech Perception and Spoken Word Comprehension","authors":"Jovan Eranovic, D. Pape, M. Stroińska, E. Service, Marijana Matkovski","doi":"10.21437/interspeech.2022-10543","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10543","url":null,"abstract":"The aim of the study was to find out which of the three categories of noise acting as maskers ( energetic : masking portions of the target speech with its energy; informational : both target and masker compete for the listener’s attention; degraded : reverberated or filtered speech) is most detrimental to speech perception and spoken word comprehension. To that end, participants completed three tasks with and without added noise – listening span, listening comprehension, and shadowing – where shadowing is considered primarily a task relying on speech perception, with the other two tasks considered to rely on word comprehension and semantic inference. The study found informational masking to be most detrimental to speech perception, while energetic masking and sound degradation were most detrimental to spoken word comprehension. The results also imply that masking categories must be used with caution, since not all maskers belonging to one category had the same effect on performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46737644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10927
Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, D. Keltner, Alan S. Cowen
{"title":"State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning Approach","authors":"Alice Baird, Panagiotis Tzirakis, Jeff Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, D. Keltner, Alan S. Cowen","doi":"10.21437/interspeech.2022-10927","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10927","url":null,"abstract":"Humans infer a wide array of meanings from expressive nonverbal vocalizations, e.g., laughs, cries, and sighs. Thus far, computational research has primarily focused on the coarse classification of vocalizations such as laughs, but that approach overlooks significant variations in the meaning of distinct laughs (e.g., amusement, awkwardness, triumph) and the rich array of more nuanced vocalizations people form. Nonverbal vocalizations are shaped by the emotional state an individual chooses to convey, their wellbeing, and (as with the voice more broadly) their identity-related traits. In the present study, we utilize a large-scale dataset comprising more than 35 hours of densely labeled vocal bursts to model emotionally expressive states and demographic traits from nonverbal vocalizations. We compare a single-task and multi-task deep learning architecture to explore how models can leverage acoustic co-dependencies that may exist between the expression of 10 emotions by vocal bursts and the demographic traits of the speaker. Results show that nonverbal vocalizations can be reliably leveraged to predict emotional expression, age, and country of origin. In a multi-task setting, our experiments show that joint learning of emotional expression and demographic traits appears to yield robust results, primarily beneficial for the classification of a speaker’s country of origin.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46758450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10388
Ajinkya Kulkarni, Vincent Colotte, D. Jouvet
{"title":"Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems","authors":"Ajinkya Kulkarni, Vincent Colotte, D. Jouvet","doi":"10.21437/interspeech.2022-10388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10388","url":null,"abstract":"The main objective of this work is to study the expressivity transfer in a speaker’s voice for which no expressive speech data is available in non-autoregressive end-to-end TTS systems. We investigated the expressivity transfer capability of probability density estimation based on deep generative models, namely Generative Flow (Glow) and diffusion probabilistic models (DPM). The usage of deep generative models provides better log likelihood estimates and tractability of the system, subsequently providing high-quality speech synthesis with faster inference speed. Furthermore, we propose the usage of various expressivity encoders, which assist in expressivity transfer in the text-to-speech (TTS) system. More precisely, we used self-attention statistical pooling and multi-scale expressivity encoder architectures for creating a meaningful representation of expressivity. In addition to traditional subjective metrics used for speech synthesis evaluation, we incorporated cosine-similarity to measure the strength of attributes associated with speaker and expressivity. The performance of a non-autoregressive TTS system with a multi-scale expressivity encoder showed better expressivity transfer on Glow and DPM-based decoders. Thus, illustrating the ability of multi-scale architecture to apprehend the underlying attributes of expressivity from multiple acoustic features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46778202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10782
N. Young, D. Britain, A. Leemann
{"title":"A blueprint for using deepfakes in sociolinguistic matched-guise experiments","authors":"N. Young, D. Britain, A. Leemann","doi":"10.21437/interspeech.2022-10782","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10782","url":null,"abstract":"Matched-guise paradigms, which are used extensively in speaker and accent evaluation studies, have long been ham-pered by empirical holes. We offer a solution by incorporating deepfake technology, which greatly reduces the number of potential confounds. We constructed a sociophonetic experiment whereby high-rising terminal (a.k.a. ªuptalkº) ± and the lack thereof ± was superimposed onto a deepfaked ªbeautifulº and ªless beautifulº female guise. The resulting four guises were in-corporated into a 2x2-factor between-subjects experiment tested on female evaluators. Each evaluator assessed their respective guise against a list of prescribed attributes and offered free-form comments. Results align with studies on high-rising terminal as well as intuitions concerning conventional beauty, which validates the technique and motivates its wider adoption.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43983113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10106
Maokui He, Jun Du, Chin-Hui Lee
{"title":"End-to-End Audio-Visual Neural Speaker Diarization","authors":"Maokui He, Jun Du, Chin-Hui Lee","doi":"10.21437/interspeech.2022-10106","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10106","url":null,"abstract":"In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classification output layers produces activities of each speaker. With the finely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades significantly using the visual-only model. Evaluated on the datasets of the first multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44341659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-759
Martin Lenglet, O. Perrotin, G. Bailly
{"title":"Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings","authors":"Martin Lenglet, O. Perrotin, G. Bailly","doi":"10.21437/interspeech.2022-759","DOIUrl":"https://doi.org/10.21437/interspeech.2022-759","url":null,"abstract":"Since neural Text-To-Speech models have achieved such high standards in terms of naturalness, the main focus of the field has gradually shifted to gaining more control over the expressiveness of the synthetic voices. One of these leverages is the control of the speaking rate that has become harder for a human operator to control since the introduction of neural attention networks to model speech dynamics. While numerous models have reintroduced an explicit duration control (ex: Fast-Speech2), these models generally rely on additional tasks to complete during their training. In this paper, we show how an acoustic analysis of the internal embeddings delivered by the encoder of an unsupervised end-to-end TTS Tacotron2 model is enough to identify and control some acoustic parameters of interest. Specifically, we compare this speaking rate control with the duration control offered by a supervised FastSpeech2 model. Experimental results show that the control provided by embeddings reproduces a behaviour closer to natural speech data.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44434627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-387
Andréi Birladeanu, H. Minnis, A. Vinciarelli
{"title":"Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver Sessions","authors":"Andréi Birladeanu, H. Minnis, A. Vinciarelli","doi":"10.21437/interspeech.2022-387","DOIUrl":"https://doi.org/10.21437/interspeech.2022-387","url":null,"abstract":"To the best of our knowledge, this is the first work aimed at automatic detection of Reactive Attachment Disorder, a psychiatric issue typically affecting children that experienced abuse and neglect. The proposed approach is based on the analysis of turn-taking during clinical sessions and the experiments involved 61 children and their caregivers. The results show that it is possible to detect the pathology with accuracy up to 69.2% (F1 Score 68.8%). In addition, the experiments show that the pathology tends to leave different behavioral traces in different activities. This might explain why Reactive Attachment Disorder is difficult to diagnose and tends to remain undetected. In such a context, methodologies like those proposed in this work can be a valuable support in clinical practice.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44459472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-355
Anqi Lyu, Zhiming Wang, Huijia Zhu
{"title":"Ant Multilingual Recognition System for OLR 2021 Challenge","authors":"Anqi Lyu, Zhiming Wang, Huijia Zhu","doi":"10.21437/interspeech.2022-355","DOIUrl":"https://doi.org/10.21437/interspeech.2022-355","url":null,"abstract":"This paper presents a comprehensive description of the Ant multilingual recognition system for the 6th Oriental Language Recognition(OLR 2021) Challenge. Inspired by the transfer learning scheme, the encoder components of language iden-tification(LID) model is initialized from pretrained automatic speech recognition(ASR) networks for integrating the lexical phonetic information into language identification. The ASR model is encoder-decoder networks based on U2++ architecture [1]; then inheriting the shared conformer encoder [2] from pretrained ASR model which is effective at global information capturing and local invariance modeling, the LID model, with an attentive statistical pooling layer and a following linear projection layer added on the encoder, is further finetuned until its optimum. Furthermore, data augmentation, score normalization and model ensemble are good strategies to improve performance indicators, which are investigated and analysed in detail within our paper. In the OLR 2021 Challenge, our submitted systems ranked the top in both tasks 1 and 2 with primary met-rics of 0.0025 and 0.0039 respectively, less than 1/3 of the second place 1 , which fully illustrates that our methodologies for multilingual identification are effectual and competitive in real-life scenarios.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45084675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}