Interspeech最新文献_第9页

Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings 使用自监督语音和文本预训练嵌入建模的语音情感识别的可解释性

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10685

K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa

{"title":"Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings","authors":"K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa","doi":"10.21437/interspeech.2022-10685","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10685","url":null,"abstract":"Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4496-4500"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49559750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge 2022年欺骗感知说话人验证挑战赛的CLIPS系统

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-320

Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao

引用次数: 2

Single-channel speech enhancement using Graph Fourier Transform 使用图傅里叶变换的单通道语音增强

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-740

Chenhui Zhang, Xiang Pan

引用次数: 0

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion 实时单次语音转换的可流语音表示解纠缠和多级韵律建模

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10277

Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu

引用次数: 4

A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling 基于扩散概率建模的统一语音克隆和语音转换系统

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10879

T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei

引用次数: 5

SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms SoundDoA：从声音原始波形中学习声源到达方向和语义

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-378

Yuhang He, A. Markham

{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-378","url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47392486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure 重叠频率分布网络:频率感知语音欺骗对策

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-657

Sunmook Choi, Il-Youp Kwak, Seungsang Oh

{"title":"Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure","authors":"Sunmook Choi, Il-Youp Kwak, Seungsang Oh","doi":"10.21437/interspeech.2022-657","DOIUrl":"https://doi.org/10.21437/interspeech.2022-657","url":null,"abstract":"Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3558-3562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47675680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

How do our eyebrows respond to masks and whispering? The case of Persians 我们的眉毛对面具和窃窃私语有何反应？波斯人的情况

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10867

Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh

{"title":"How do our eyebrows respond to masks and whispering? The case of Persians","authors":"Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh","doi":"10.21437/interspeech.2022-10867","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10867","url":null,"abstract":"Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2023-2027"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47894190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training 会话历史依赖和独立ASR系统的多历史训练端到端联合建模

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11357

Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

{"title":"End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training","authors":"Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando","doi":"10.21437/interspeech.2022-11357","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11357","url":null,"abstract":"This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3218-3222"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows 双向计算机辅助发音训练与规范化流

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-878

Zhan Zhang, Yuehai Wang, Jianyi Yang

引用次数: 0