Interspeech最新文献_第10页

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows 双向计算机辅助发音训练与规范化流

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-878

Zhan Zhang, Yuehai Wang, Jianyi Yang

引用次数: 0

Homophone Disambiguation Profits from Durational Information 同音字消歧得益于历时信息

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10109

Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf

{"title":"Homophone Disambiguation Profits from Durational Information","authors":"Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf","doi":"10.21437/interspeech.2022-10109","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10109","url":null,"abstract":"Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3198-3202"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48930719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frame-Level Stutter Detection 框架水平斯图加特检测

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-204

John Harvill, M. Hasegawa-Johnson, C. Yoo

引用次数: 4

Soft-label Learn for No-Intrusive Speech Quality Assessment 无干扰语音质量评估的软标签学习

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10400

Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi

引用次数: 1

Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification 用于欺骗感知说话人验证的规范约束分数级集成

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-470

Peng Zhang, Peng Hu, Xueliang Zhang

引用次数: 5

CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content 基于cnn的音频事件识别对主要视频内容的自动暴力分类和评级

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10053

Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid

{"title":"CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content","authors":"Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid","doi":"10.21437/interspeech.2022-10053","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10053","url":null,"abstract":"Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difﬁculty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classiﬁcation. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) beneﬁt due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backﬁll top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2758-2762"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41753715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Attentive Feature Fusion for Robust Speaker Verification 基于关注特征融合的稳健说话人验证

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-478

Bei Liu, Zhengyang Chen, Y. Qian

{"title":"Attentive Feature Fusion for Robust Speaker Verification","authors":"Bei Liu, Zhengyang Chen, Y. Qian","doi":"10.21437/interspeech.2022-478","DOIUrl":"https://doi.org/10.21437/interspeech.2022-478","url":null,"abstract":"As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"286-290"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41894666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting 人工耳蜗听者对可理解性的言语修饰:元音和辅音增强的个体效应

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11131

Juliana N. Saba, J. Hansen

{"title":"Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting","authors":"Juliana N. Saba, J. Hansen","doi":"10.21437/interspeech.2022-11131","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11131","url":null,"abstract":"Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5473-5477"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41805320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches 噪声语音去噪的深度自监督学习

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-306

Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada

引用次数: 1

Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition 基于联合训练框架的粗粒度注意力融合复杂语音增强和端到端语音识别

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-698

Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang

{"title":"Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition","authors":"Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang","doi":"10.21437/interspeech.2022-698","DOIUrl":"https://doi.org/10.21437/interspeech.2022-698","url":null,"abstract":"Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3794-3798"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41833974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1