InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10077
Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li
{"title":"Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo Cancellation","authors":"Chang Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li","doi":"10.21437/interspeech.2022-10077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10077","url":null,"abstract":"Recent studies indicate the effectiveness of deep learning (DL) based methods for acoustic echo cancellation (AEC) in background noise and nonlinear distortion scenarios. However, content and speaker variations degrade the performance of such DL-based AEC models. In this study, we propose a AEC model that takes phonetic and speaker identities features as auxiliary inputs, and present a complex dual-path convolutional transformer network (DPCTNet). Given an input signal, the phonetic and speaker identities features extracted by the contrastive predictive coding network that is a self-supervised pre-training model, and the complex spectrum generated by short time Fourier transform are treated as the spectrum pattern inputs for DPCTNet. In addition, the DPCTNet applies an encoder-decoder architecture improved by inserting a dual-path transformer to effectively model the extracted inputs in a single frame and the dependence between consecutive frames. Com-parative experimental results showed that the performance of AEC can be improved by explicitly considering phonetic and speaker identities features.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2513-2517"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43855916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10513
Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku
{"title":"Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals","authors":"Sudarsana Reddy Kadiri, F. Javanmardi, P. Alku","doi":"10.21437/interspeech.2022-10513","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10513","url":null,"abstract":"Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5253-5257"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43858994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-888
C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse
{"title":"Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition","authors":"C. Ahn, Chamara Kasun, S. Sivadas, Jagath Rajapakse","doi":"10.21437/interspeech.2022-888","DOIUrl":"https://doi.org/10.21437/interspeech.2022-888","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"744-748"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46927325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11388
K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue
{"title":"Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with Autism","authors":"K. Ochi, Nobutaka Ono, Keiho Owada, Kuroda Miho, S. Sagayama, H. Yamasue","doi":"10.21437/interspeech.2022-11388","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11388","url":null,"abstract":"Autism spectral disorder (ASD) is a highly prevalent neurodevelopmental disorder characterized by deficits in communication and social interaction. Head-nodding, a kind of visual backchannels, is used to co-construct the conversation and is crucial to smooth social interaction. In the present study, we quantitively analyze how head-nodding relates to speech turn-taking and prosodic change in Japanese conversation. The results showed that nodding was less frequently observed in ASD participants, especially around speakers’ turn transitions, whereas it was notable just before and after turn-taking in individuals with typical development (TD). Analysis using 16 sec of long-time sliding segments revealed that synchronization between nod frequency and mean vocal intensity was higher in the TD group than in the ASD group. Classification by a support vector machine (SVM) using these proposed features achieved high performance with an accuracy of 91.1% and an F-measure of 0.942. In addition, the results indicated an optimal way of nodding according to turn-ending and emphasis, which could provide standard responses for reference or feedback in social skill training for people with ASD. Furthermore, the natural timing of nodding implied by the results can also be applied to developing interactive responses in humanoid robots or computer graphic (CG) agents.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1136-1140"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42124598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11077
Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai
{"title":"Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition","authors":"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai","doi":"10.21437/interspeech.2022-11077","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11077","url":null,"abstract":"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3824-3828"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41762710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-12
Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell
{"title":"Native phonotactic interference in L2 vowel processing: Mouse-tracking reveals cognitive conflicts during identification","authors":"Yizhou Wang, R. Bundgaard-Nielsen, B. Baker, Olga Maxwell","doi":"10.21437/interspeech.2022-12","DOIUrl":"https://doi.org/10.21437/interspeech.2022-12","url":null,"abstract":"Regularities of phoneme distribution in a listener’s native language (L1), i.e., L1 phonotactics, can at times induce interference in their perception of second language (L2) phonemes and phonemic strings. This paper presents a study examining phonological interference experienced by L1 Mandarin listeners in identifying the English /i/ vowel in three consonantal contexts /p, f, w/, which have different distributional patterns in Mandarin phonology: /pi/ is a licit sequence in Mandarin, */fi/ is illicit due to co-occurrence restrictions, and */wi/ is illicit due to Mandarin contextual allophony. L1 Mandarin listeners completed two versions of an identification experiment (keystroke and mouse-tracking), in which they identified vowels in different consonantal contexts. Analysis of error rates, response times, and hand motions in the tasks suggests that L1 co-occurrence restriction and contextual allophony induce different levels of phonological interference in L2 vowel perception compared to the licit control condition. In support of the dynamic theory of linguistic cognition, our results indicate that illicit phonotactic contexts can lead to more identification errors, longer decision processes, and spurious activation of a distractor category.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5223-5227"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41796527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis","authors":"Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Yuki Saito, Yusuke Ijima, Ryo Masumura, H. Saruwatari","doi":"10.21437/interspeech.2022-638","DOIUrl":"https://doi.org/10.21437/interspeech.2022-638","url":null,"abstract":"We propose a speech-synthesis model for predicting appropriate voice styles on the basis of the character-annotated text for audiobook speech synthesis. An audiobook is more engaging when the narrator makes distinctive voices depending on the story characters. Our goal is to produce such distinctive voices in the speech-synthesis framework. However, such distinction has not been extensively investigated in audiobook speech synthesis. To enable the speech-synthesis model to achieve distinctive voices depending on characters with minimum extra anno-tation, we propose a speech synthesis model to predict character appropriate voices from quotation-annotated text. Our proposed model involves character-acting-style extraction based on a vector quantized variational autoencoder, and style prediction from quotation-annotated texts which enables us to automate audiobook creation with character-distinctive voices from quotation-annotated texts. To the best of our knowledge, this is the first attempt to model intra-speaker voice style depending on character acting for audiobook speech synthesis. We conducted subjective evaluations of our model, and the results indicate that the proposed model generated more distinctive character voices compared to models that do not use the explicit character-acting-style while maintaining the naturalness of synthetic speech.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4551-4555"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41835460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10951
Avamarie Brueggeman, J. Hansen
{"title":"Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion Perception","authors":"Avamarie Brueggeman, J. Hansen","doi":"10.21437/interspeech.2022-10951","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10951","url":null,"abstract":"Despite significant progress in areas such as speech recognition, cochlear implant users still experience challenges related to identifying various speaker traits such as gender, age, emotion, accent, etc. In this study, we focus on emotion as one trait. We propose the use of emotion intensity conversion to perceptually enhance emotional speech with the goal of improving speech emotion recognition for cochlear implant users. To this end, we utilize a parallel speech dataset containing emotion and intensity labels to perform conversion from normal to high intensity emotional speech. A non-negative matrix factorization method is integrated to perform emotion intensity conversion via spectral mapping. We evaluate our emotional speech enhancement using a support vector machine model for emotion recognition. In addition, we perform an emotional speech recognition listener experiment with normal hearing listeners using vocoded audio. It is suggested that such enhancement will benefit speaker trait perception for cochlear implant users.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2268-2272"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46677917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-817
J. Rugayan, T. Svendsen, G. Salvi
{"title":"Semantically Meaningful Metrics for Norwegian ASR Systems","authors":"J. Rugayan, T. Svendsen, G. Salvi","doi":"10.21437/interspeech.2022-817","DOIUrl":"https://doi.org/10.21437/interspeech.2022-817","url":null,"abstract":"Evaluation metrics are important for quanitfying the perfor- mance of Automatic Speech Recognition (ASR) systems. How-ever, the widely used word error rate (WER) captures errors at the word-level only and weighs each error equally, which makes it insufficient to discern ASR system performance for down- stream tasks such as Natural Language Understanding (NLU) or information retrieval. We explore in this paper a more ro- bust and discriminative evaluation metric for Norwegian ASR systems through the use of semantic information modeled by a transformer-based language model. We propose Aligned Semantic Distance (ASD) which employs dynamic programming to quantify the similarity between the reference and hypothesis text. First, embedding vectors are generated using the Nor- BERT model. Afterwards, the minimum global distance of the optimal alignment between these vectors is obtained and nor- malized by the sequence length of the reference embedding vec-tor. In addition, we present results using Semantic Distance (SemDist), and compare them with ASD. Results show that for the same WER, ASD and SemDist values can vary significantly, thus, exemplifying that not all recognition errors can be consid-ered equally important. We investigate the resulting data, and present examples which demonstrate the nuances of both metrics in evaluating various transcription errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2283-2287"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46159419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
InterspeechPub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10309
Qi Chen, Binghuai Lin, Yanlu Xie
{"title":"An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English","authors":"Qi Chen, Binghuai Lin, Yanlu Xie","doi":"10.21437/interspeech.2022-10309","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10309","url":null,"abstract":"Mispronunciation Detection and Diagnosis (MD&D) technology is used for detecting mispronunciations and providing feedback. Most MD&D systems are based on phoneme recognition. However, few studies have made use of the predefined reference text which has been provided to second language (L2) learners while practicing pronunciation. In this paper, we propose a novel alignment method based on linguistic knowledge of articulatory manner and places to align the phone sequences of the reference text with L2 learners speech. After getting the alignment results, we concatenate the corresponding phoneme embedding and the acoustic features of each speech frame as input. This method makes reasonable use of the reference text information as extra input. Experimental results show that the model can implicitly learn valid information in the reference text by this method. Meanwhile, it avoids introducing misleading information in the reference text, which will cause false acceptance (FA). Besides, the method incorporates articulatory features, which helps the model recognize phonemes. We evaluate the method on the L2-ARCTIC dataset and it turns out that our approach improves the F1-score over the state-of-the-art system by 4.9% relative.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4342-4346"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46451269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}