Interspeech最新文献

筛选
英文 中文
Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method 语义差分法对穿透性语音的感知评价
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-100
T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano
{"title":"Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method","authors":"T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano","doi":"10.21437/interspeech.2022-100","DOIUrl":"https://doi.org/10.21437/interspeech.2022-100","url":null,"abstract":"Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3063-3067"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis 用于表达文本到语音合成的约束跨模态说话风格建模
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11275
Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng
{"title":"CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis","authors":"Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng","doi":"10.21437/interspeech.2022-11275","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11275","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5533-5537"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45690327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting 基于空间感知的多渠道多方会议发言人日记
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/Interspeech.2022-11412
Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong
{"title":"Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting","authors":"Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong","doi":"10.21437/Interspeech.2022-11412","DOIUrl":"https://doi.org/10.21437/Interspeech.2022-11412","url":null,"abstract":"This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which consists of attention superdirective beamforming (ASDB) block and Conformer encoder. The proposed ASDB is a self-adapted channel-wise block that extracts the latent spatial features of array audios by modeling interdependencies between channels. We explore DMSNet to address overlapped speech problem on multi-channel audio and achieve 93.53% accuracy on evaluation set. By performing DMSNet based overlapped speech detection (OSD) module, the diarization error rate (DER) of cluster-based diarization system decrease significantly from 13.45% to 7.64%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1491-1495"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45765166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Online Learning of Open-set Speaker Identification by Active User-registration 基于主动用户注册的开放集说话人识别在线学习
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-25
Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee
{"title":"Online Learning of Open-set Speaker Identification by Active User-registration","authors":"Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee","doi":"10.21437/interspeech.2022-25","DOIUrl":"https://doi.org/10.21437/interspeech.2022-25","url":null,"abstract":"Registering each user’s identity for voice assistants is bur-densome and complex for multi-user environments like a household scenario. This is particularly true when the registration needs to happen on-the-fly with a relatively minimum effort. Most of the prior works for speaker identification (SID) do not seamlessly allow the addition of new speakers as these do not support online updates. To deal with such limitation, we introduce a novel online learning approach to open-set SID that can actively register unknown users in the household setting. Based on MPART (Message Passing Adaptive Resonance The-ory), our method performs online active semi-supervised learning for open-set SID by using speaking embedding vectors to infer new speakers and request user’s identity. Our method pro-gressively improves the overall SID performance without forgetting, making it attractive for many interactive real-world ap-plications. We evaluate our model for the online learning setting of an open-set SID task where new speakers are added on-the-fly, demonstrating its superior performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5065-5069"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45807880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild MSDWild:狂野中的多模态说话人日记数据集
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10466
Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu
{"title":"MSDWild: Multi-modal Speaker Diarization Dataset in the Wild","authors":"Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu","doi":"10.21437/interspeech.2022-10466","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10466","url":null,"abstract":"Speaker diarization in real-world acoustic environments is a challenging task of increasing interest from both academia and industry. Although it has been widely accepted that incorporat-ing visual information benefits audio processing tasks such as speech recognition, there is currently no fully released dataset that can be used for benchmarking multi-modal speaker diarization performance in real-world environments. In this pa-per, we release MSDWild ∗ , a benchmark dataset for multimodal speaker diarization in the wild. The dataset is collected from public videos, covering rich real-world scenarios and languages. All video clips are naturally shot videos without over-editing such as lens switching. Audio and video are both released. In particular, MSDWild has a large portion of the naturally overlapped speech, forming an excellent testbed for cocktail-party problem research. Furthermore, we also conduct baseline experiments on the dataset using audio-only, visual-only, and audio-visual speaker diarization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1476-1480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42383323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Zero-Shot Foreign Accent Conversion without a Native Reference 没有本机引用的零样本外来重音转换
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10664
Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna
{"title":"Zero-Shot Foreign Accent Conversion without a Native Reference","authors":"Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna","doi":"10.21437/interspeech.2022-10664","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10664","url":null,"abstract":"Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4920-4924"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43009987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Effects of laryngeal manipulations on voice gender perception 喉部操作对声音性别感知的影响
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10815
Zhaoyan Zhang, Jason Zhang, J. Kreiman
{"title":"Effects of laryngeal manipulations on voice gender perception","authors":"Zhaoyan Zhang, Jason Zhang, J. Kreiman","doi":"10.21437/interspeech.2022-10815","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10815","url":null,"abstract":"This study aims to identify laryngeal manipulations that would allow a male to approximate a female-sounding voice, and that can be targeted in voice feminization surgery or therapy. Synthetic voices were generated using a three-dimensional vocal fold model with parametric variations in vocal fold geometry, stiffness, adduction, and subglottal pressure. The vocal tract was kept constant in order to focus on the contribution of laryngeal manipulations. Listening subjects were asked to judge if a voice sounded male or female, or if they were unsure. Results showed the expected large effect of the fundamental frequency (F0) and a moderate effect of spectral shape on gender perception. A mismatch between F0 and spectral shape cues (e.g., low F0 paired with high H1-H2) contributed to ambiguity in gender perception, particularly for voices with F0 in the intermediate range between those of typical adult males and females. Physiologically, the results showed that a female-sounding voice can be produced by decreasing vocal fold thickness and increasing vocal fold transverse stiffness in the coronal plane, changes in which modified both F0 and spectral shape. In contrast, laryngeal manipulations with limited impact on F0 or spectral shape were shown to be less effective in modifying gender perception.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1856-1860"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43054390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental learning for RNN-Transducer based speech recognition models 基于RNN传感器的语音识别模型的增量学习
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10795
Deepak Baby, Pasquale D’Alterio, Valentin Mendelev
{"title":"Incremental learning for RNN-Transducer based speech recognition models","authors":"Deepak Baby, Pasquale D’Alterio, Valentin Mendelev","doi":"10.21437/interspeech.2022-10795","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10795","url":null,"abstract":"This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"71-75"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47633462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification 用于构音障碍语音自动分类的对抗性自由说话人身份不变表示学习
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-402
Parvaneh Janbakhshi, I. Kodrasi
{"title":"Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification","authors":"Parvaneh Janbakhshi, I. Kodrasi","doi":"10.21437/interspeech.2022-402","DOIUrl":"https://doi.org/10.21437/interspeech.2022-402","url":null,"abstract":"Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2138-2142"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48272141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge distillation for In-memory keyword spotting model 内存关键字识别模型的知识精馏
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-633
Zeyang Song, Qi Liu, Qu Yang, Haizhou Li
{"title":"Knowledge distillation for In-memory keyword spotting model","authors":"Zeyang Song, Qi Liu, Qu Yang, Haizhou Li","doi":"10.21437/interspeech.2022-633","DOIUrl":"https://doi.org/10.21437/interspeech.2022-633","url":null,"abstract":"We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4128-4132"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48292021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信