InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10236
Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin
{"title":"An Automatic Soundtracking System for Text-to-Speech Audiobooks","authors":"Zikai Chen, Lin Wu, Junjie Pan, Xiang Yin","doi":"10.21437/interspeech.2022-10236","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10236","url":null,"abstract":"Background music (BGM) plays an essential role in audiobooks, which can enhance the immersive experience of audiences and help them better understand the story. However, welldesigned BGM still requires human effort in the text-to-speech (TTS) audiobook production, which is quite time-consuming and costly. In this paper, we introduce an automatic soundtracking system for TTS-based audiobooks. The proposed system divides the soundtracking process into three tasks: plot partition, plot classification, and music selection. The experiments shows that both our plot partition module and plot classification module outperform baselines by a large margin. Furthermore, TTS-based audiobooks produced with our proposed automatic soundtracking system achieves comparable performance to that produced with the human soundtracking system. To our best of knowledge, this is the first work of automatic soundtracking system for audiobooks. Demos are available on https: //acst1223.github.io/interspeech2022/main.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"476-480"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45188982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-576
Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani
{"title":"Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model","authors":"Ryu Takeda, Yui Sudo, K. Nakadai, Kazunori Komatani","doi":"10.21437/interspeech.2022-576","DOIUrl":"https://doi.org/10.21437/interspeech.2022-576","url":null,"abstract":"Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3789-3793"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45261354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10426
Mamady Nabe, J. Diard, J. Schwartz
{"title":"Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech","authors":"Mamady Nabe, J. Diard, J. Schwartz","doi":"10.21437/interspeech.2022-10426","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10426","url":null,"abstract":"Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this pa-per, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4671-4675"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45456275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10982
Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka
{"title":"The Prosody of Cheering in Sport Events","authors":"Marzena Żygis, Sarah Wesołek, Nina Hosseini-Kivanani, M. Krifka","doi":"10.21437/interspeech.2022-10982","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10982","url":null,"abstract":"Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5283-5287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43143290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-197
Vera Bernhard, Sandra Schwab, J. Goldman
{"title":"Acoustic Stress Detection in Isolated English Words for Computer-Assisted Pronunciation Training","authors":"Vera Bernhard, Sandra Schwab, J. Goldman","doi":"10.21437/interspeech.2022-197","DOIUrl":"https://doi.org/10.21437/interspeech.2022-197","url":null,"abstract":"We propose a system for automatic lexical stress detection in isolated English words. It is designed to be part of the computer-assisted pronunciation training application MIAPARLE (“https://miaparle.unige.ch”) that specifically focuses on stress contrasts acquisition. Training lexical stress cannot be disregarded in language education as the accuracy in production highly affects the intelligibility and perceived fluency of an L2 speaker. The pipeline automatically segments audio input into syllables over which duration, intensity, pitch, and spectral information is calculated. Since the stress of a syllable is defined relative to its neighboring syllables, the values obtained over the syllables are complemented with differential values to the preceding and following syllables. The resulting feature vectors, retrieved from 1011 recordings of single words spoken by English natives, are used to train a Voting Classifier composed of four supervised classifiers, namely a Support Vector Machine, a Neural Net, a K Nearest Neighbor, and a Random Forest classifier. The approach determines syllables of a single word as stressed or unstressed with an F1 score of 94% and an accuracy of 96%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3143-3147"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49006649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10315
Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang
{"title":"Non-intrusive Speech Quality Assessment with a Multi-Task Learning based Subband Adaptive Attention Temporal Convolutional Neural Network","authors":"Xiaofeng Shu, Yanjie Chen, Chuxiang Shang, Yan Zhao, Chengshuai Zhao, Yehang Zhu, Chuanzeng Huang, Yuxuan Wang","doi":"10.21437/interspeech.2022-10315","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10315","url":null,"abstract":"In terms of subjective evaluations, speech quality has been gen-erally described by a mean opinion score (MOS). In recent years, non-intrusive speech quality assessment shows an active progress by leveraging deep learning techniques. In this paper, we propose a new multi-task learning based model, termed as subband adaptive attention temporal convolutional neural network (SAA-TCN), to perform non-intrusive speech quality assessment with the help of MOS value interval detector (VID) auxiliary task. Instead of using fullband magnitude spectrogram, the proposed model takes subband magnitude spectrogram as the input to reduce model parameters and prevent overfitting. To effectively utilize the energy distribution information along the subband frequency dimension, subband adaptive attention (SAA) is employed to enhance the TCN model. Experimental results reveal that the proposed method achieves a superior performance on predicting the MOS values. In ConferencingSpeech 2022 Challenge, our method achieves a mean Pearson’s correlation coefficient (PCC) score of 0.763 and outperforms the challenge baseline method by 0.233.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3298-3302"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49153770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10271
Z. Mostaani, M. Magimai.-Doss
{"title":"On Breathing Pattern Information in Synthetic Speech","authors":"Z. Mostaani, M. Magimai.-Doss","doi":"10.21437/interspeech.2022-10271","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10271","url":null,"abstract":"The respiratory system is an integral part of human speech production. As a consequence, there is a close relation between respiration and speech signal, and the produced speech signal carries breathing pattern related information. Speech can also be generated using speech synthesis systems. In this paper, we investigate whether synthetic speech carries breathing pattern related information in the same way as natural human speech. We address this research question in the framework of logical-access presentation attack detection using embeddings extracted from neural networks pre-trained for speech breathing pattern estimation. Our studies on ASVSpoof 2019 challenge data show that there is a clear distinction between the extracted breathing pattern embedding of natural human speech and syn-thesized speech, indicating that speech synthesis systems tend to not carry breathing pattern related information in the same way as human speech. Whilst, this is not the case with voice conversion of natural human speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2768-2772"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48554971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11378
Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He
{"title":"Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding","authors":"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He","doi":"10.21437/interspeech.2022-11378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11378","url":null,"abstract":"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1131-1135"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44733411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10986
Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier
{"title":"Neural correlates of acoustic and semantic cues during speech segmentation in French","authors":"Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier","doi":"10.21437/interspeech.2022-10986","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10986","url":null,"abstract":"Natural speech is highly complex and variable. Particularly, spoken language, in contrast to written language, has no clear word boundaries. Adult listeners can exploit different types of information to segment the continuous stream such as acoustic and semantic information. However, the weight of these cues, when co-occurring, remains to be determined. Behavioural tasks are not conclusive on this point as they focus participants ’ attention on certain sources of information, thus biasing the results. Here, we looked at the processing of homophonic utterances such as l’amie vs la mie (both /lami/) which include fine acoustic differences and for which the meaning changes depending on segmentation. To examine the perceptual resolution of such ambiguities when semantic information is available, we measured the online processing of sentences containing such sequences in an ERP experiment involving no active task. In a congruent context, semantic information matched the acoustic signal of the word amie, while, in the incongruent condition, the semantic information carried by the sentence and the acoustic signal were leading to different lexical candidates. No clear neural markers for the use of acoustic cues were found. Our results suggest a preponderant weight of semantic information over acoustic information during natural spoken sentence processing.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4058-4062"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41513074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}