Interspeech最新文献_第2页

Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices 基于麦克风阵列和异步可穿戴设备的协同语音分离

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11025

R. Corey, Manan Mittal, Kanad Sarkar, A. Singer

{"title":"Cooperative Speech Separation With a Microphone Array and Asynchronous Wearable Devices","authors":"R. Corey, Manan Mittal, Kanad Sarkar, A. Singer","doi":"10.21437/interspeech.2022-11025","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11025","url":null,"abstract":"We consider the problem of separating speech from several talkers in background noise using a fixed microphone array and a set of wearable devices. Wearable devices can provide reliable information about speech from their wearers, but they typically cannot be used directly for multichannel source separation due to network delay, sample rate offsets, and relative motion. Instead, the wearable microphone signals are used to compute the speech presence probability for each talker at each time-frequency index. Those parameters, which are robust against small sample rate offsets and relative motion, are used to track the second-order statistics of the speech sources and background noise. The fixed array then separates the speech signals using an adaptive linear time-varying multichannel Wiener filter. The proposed method is demonstrated using real-room recordings from three human talkers with binaural earbud microphones and an eight-microphone tabletop array. but are useful for distin-guishing between different sources because of their known positions relative to the talkers. The proposed system uses the wearable devices to estimate SPP values, which are then used to learn the second-order statistics for each source at the microphones of the fixed array. The array separates the sources using an adaptive linear time-varying spatial filter suitable for real-time applications. This work combines the cooperative ar-chitecture of [19], the distributed SPP method of [18], and the motion-robust modeling of [15]. The system is implemented adaptively and demonstrated using live human talkers.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5398-5402"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45171254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Automatic Soundtracking System for Text-to-Speech Audiobooks 一种用于文本到语音有声读物的自动声音跟踪系统

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10236

Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin

引用次数: 0

Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model 基于神经编码器-解码器模型的基于潜在话语证据的缺失数据ASR经验抽样

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-576

Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani

{"title":"Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model","authors":"Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani","doi":"10.21437/interspeech.2022-576","DOIUrl":"https://doi.org/10.21437/interspeech.2022-576","url":null,"abstract":"Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3789-3793"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45261354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech 等时是美丽的吗？语音中的等时性有助于神经振荡模型中的音节事件检测

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10426

Mamady Nabe, J. Diard, J. Schwartz

{"title":"Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech","authors":"Mamady Nabe, J. Diard, J. Schwartz","doi":"10.21437/interspeech.2022-10426","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10426","url":null,"abstract":"Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4671-4675"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Mandarin Lombard Grid: a Lombard-grid-like corpus of Standard Chinese 普通话伦巴第格：一个类似伦巴第格的标准汉语语料库

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-854

Yuhong Yang, Xufeng Chen, Qingmu Liu, Weiping Tu, Hongyang Chen, Linjun Cai

引用次数: 3

The Prosody of Cheering in Sport Events 体育赛事中欢呼的韵律

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10982

Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka

{"title":"The Prosody of Cheering in Sport Events","authors":"Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka","doi":"10.21437/interspeech.2022-10982","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10982","url":null,"abstract":"Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5283-5287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43143290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training 计算机辅助发音训练中孤立英语单词的声重音检测

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-197

Vera Bernhard, Sandra Schwab, J. Goldman

{"title":"Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training","authors":"Vera Bernhard, Sandra Schwab, J. Goldman","doi":"10.21437/interspeech.2022-197","DOIUrl":"https://doi.org/10.21437/interspeech.2022-197","url":null,"abstract":"We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3143-3147"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49006649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network 基于多任务学习的子带自适应注意时间卷积神经网络非侵入性语音质量评价

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10315

Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang

{"title":"Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network","authors":"Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang","doi":"10.21437/interspeech.2022-10315","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10315","url":null,"abstract":"In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overﬁtting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefﬁcient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3298-3302"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49153770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

On Breathing Pattern Information in Synthetic Speech 合成语音中的呼吸模式信息

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10271

Z. Mostaani, M. Magimai.-Doss

引用次数: 2

Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding 基于多粒度对齐的跨模态迁移学习用于端到端口语理解

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11378

Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He

{"title":"Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding","authors":"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He","doi":"10.21437/interspeech.2022-11378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11378","url":null,"abstract":"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1131-1135"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44733411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5