Interspeech最新文献

筛选
英文 中文
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection 呼吸和语音信号的声学表示学习用于COVID-19检测
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10376
Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh
{"title":"Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection","authors":"Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh","doi":"10.21437/interspeech.2022-10376","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10376","url":null,"abstract":"In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2863-2867"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion 日语asr -鲁棒预训练的伪错误句模型
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-327
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
{"title":"Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion","authors":"Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida","doi":"10.21437/interspeech.2022-327","DOIUrl":"https://doi.org/10.21437/interspeech.2022-327","url":null,"abstract":"Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2688-2692"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47792200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages 自下而上发现不同语言中响应令牌(“后台通道”)的结构和变化
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11288
Andreas Liesenfeld, Mark Dingemanse
{"title":"Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages","authors":"Andreas Liesenfeld, Mark Dingemanse","doi":"10.21437/interspeech.2022-11288","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11288","url":null,"abstract":"Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1126-1130"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48906637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion FlowCPCVC:一种用于任意语音转换的对比预测编码监督流框架
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-577
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang
{"title":"FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion","authors":"Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang","doi":"10.21437/interspeech.2022-577","DOIUrl":"https://doi.org/10.21437/interspeech.2022-577","url":null,"abstract":"Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2558-2562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48445659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reducing Domain mismatch in Self-supervised speech pre-training 减少自监督语音预训练中的域不匹配
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-736
M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano
{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":"https://doi.org/10.21437/interspeech.2022-736","url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48701115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method 语义差分法对穿透性语音的感知评价
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-100
T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano
{"title":"Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method","authors":"T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano","doi":"10.21437/interspeech.2022-100","DOIUrl":"https://doi.org/10.21437/interspeech.2022-100","url":null,"abstract":"Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3063-3067"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis 用于表达文本到语音合成的约束跨模态说话风格建模
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11275
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng
{"title":"CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis","authors":"Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng","doi":"10.21437/interspeech.2022-11275","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11275","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5533-5537"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45690327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting 基于空间感知的多渠道多方会议发言人日记
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/Interspeech.2022-11412
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong
{"title":"Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting","authors":"Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong","doi":"10.21437/Interspeech.2022-11412","DOIUrl":"https://doi.org/10.21437/Interspeech.2022-11412","url":null,"abstract":"This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1491-1495"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45765166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Online Learning of Open-set Speaker Identification by Active User-registration 基于主动用户注册的开放集说话人识别在线学习
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-25
Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee
{"title":"Online Learning of Open-set Speaker Identification by Active User-registration","authors":"Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee","doi":"10.21437/interspeech.2022-25","DOIUrl":"https://doi.org/10.21437/interspeech.2022-25","url":null,"abstract":"Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5065-5069"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45807880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild MSDWild:狂野中的多模态说话人日记数据集
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10466
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu
{"title":"MSDWild: Multi-modal Speaker Diarization Dataset in the Wild","authors":"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu","doi":"10.21437/interspeech.2022-10466","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10466","url":null,"abstract":"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1476-1480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42383323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信