{"title":"The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy","authors":"Ingy Farouk Emara , Nabil Hamdy Shaker","doi":"10.1016/j.specom.2024.103038","DOIUrl":"10.1016/j.specom.2024.103038","url":null,"abstract":"<div><p>The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103038"},"PeriodicalIF":3.2,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep temporal clustering features for speech emotion recognition","authors":"Wei-Cheng Lin, Carlos Busso","doi":"10.1016/j.specom.2023.103027","DOIUrl":"10.1016/j.specom.2023.103027","url":null,"abstract":"<div><p>Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for <em>speech emotion recognition</em> (SER) to adopt the concept of deep clustering as a novel <em>semi-supervised learning</em> (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence-level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the <em>temporal-net</em> or the <em>triplet loss</em> function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated <em>emotional patterns</em> in the generated clusters.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103027"},"PeriodicalIF":3.2,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001619/pdfft?md5=8a58455c8fa8b02caee36f8fcfccf479&pid=1-s2.0-S0167639323001619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan
{"title":"LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild","authors":"Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan","doi":"10.1016/j.specom.2023.103028","DOIUrl":"10.1016/j.specom.2023.103028","url":null,"abstract":"<div><p>Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network<span> to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103028"},"PeriodicalIF":3.2,"publicationDate":"2023-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139027298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang
{"title":"Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network","authors":"Nan Li , Longbiao Wang , Meng Ge , Masashi Unoki , Sheng Li , Jianwu Dang","doi":"10.1016/j.specom.2023.103024","DOIUrl":"10.1016/j.specom.2023.103024","url":null,"abstract":"<div><p><span><span>Deep learning<span> has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency </span></span>cepstral coefficients, to deep </span>neural networks<span><span> often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments, which motivated us to draw inspiration from the human auditory system. We propose a robust VAD algorithm called auditory-inspired masked modulation encoder based convolutional </span>attention network<span> (AMME-CANet) that integrates our AMME with CANet. Firstly, we investigate the design of auditory-inspired modulation features as a deep-learning encoder (AME), effectively simulating the process of sound-signal transmission to inner ear hair cells and subsequent modulation filtering by neural cells. Secondly, building upon the observed masking effects in the human auditory system, we enhance our auditory-inspired modulation encoder by incorporating a masking mechanism resulting in the AMME. The AMME amplifies cleaner speech frequencies while suppressing noise components. Thirdly, inspired by the human auditory mechanism and capitalizing on contextual information, we leverage the attention mechanism for VAD. This methodology uses an attention mechanism to assign higher weights to contextual information containing richer and more informative cues. Through extensive experimentation and evaluation, we demonstrated the superior performance of AMME-CANet in enhancing VAD under challenging noise conditions.</span></span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103024"},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138714391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunqi C. Zhang , Yusuke Hioka , C.T. Justine Hui , Catherine I. Watson
{"title":"Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English","authors":"Yunqi C. Zhang , Yusuke Hioka , C.T. Justine Hui , Catherine I. Watson","doi":"10.1016/j.specom.2023.103026","DOIUrl":"10.1016/j.specom.2023.103026","url":null,"abstract":"<div><p>Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed New Zealand English (NZE) listeners and native Mandarin listeners with different immersion conditions in NZE under negative input signal-to-noise ratio (SNR) by conducting a subjective listening test in NZE sentences. The performance of the SE algorithms in terms of speech intelligibility in the three participant groups was investigated. The result showed that the early-immersed group always achieved the highest intelligibility. The late-immersed group outperformed the non-immersed group for higher input SNR conditions, possibly due to the increasing familiarity with the NZE accent, whereas this advantage disappeared at the lowest tested input SNR conditions. The SE algorithms tested in this study failed to improve and rather degraded the speech intelligibility, indicating that these SE algorithms may not be able to reduce the perception gap between early-, late- and non-immersed listeners, nor able to improve the speech intelligibility under negative input SNR in general. These findings have implications for the future development of SE algorithms tailored to Mandarin listeners, and for understanding the impact of language immersion on speech perception in noise.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103026"},"PeriodicalIF":3.2,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001607/pdfft?md5=34c5bfa551c84f84c20ac950e89b00d4&pid=1-s2.0-S0167639323001607-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138681095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency","authors":"Stefano Bannò , Marco Matassoni","doi":"10.1016/j.specom.2023.103025","DOIUrl":"10.1016/j.specom.2023.103025","url":null,"abstract":"<div><p>In an interconnected world where English has become the lingua franca of culture, entertainment, business, and academia, the growing demand for learning English as a second language (L2) has led to an increasing interest in automatic approaches for assessing spoken language proficiency. In this regard, mastering grammar is one of the key elements of L2 proficiency.</p><p>In this paper, we illustrate an approach to L2 proficiency assessment and feedback based on grammatical features using only publicly available data for training and a small proprietary dataset for testing. Specifically, we implement it in a cascaded fashion, starting from learners’ utterances, investigating disfluency detection, exploring spoken grammatical error correction (GEC), and finally using grammatical features extracted with the spoken GEC module for proficiency assessment.</p><p>We compare this grading system to a BERT-based grader and find that the two systems have similar performances when using manual transcriptions, but their combinations bring significant improvements to the assessment performance and enhance validity and explainability. Instead, when using automatic transcriptions, the GEC-based grader obtains better results than the BERT-based grader.</p><p>The results obtained are discussed and evaluated with appropriate metrics across the proposed pipeline.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"157 ","pages":"Article 103025"},"PeriodicalIF":3.2,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138580415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sven Kachel , Adrian P. Simpson , Melanie C. Steffens
{"title":"Speakers’ vocal expression of sexual orientation depends on experimenter gender","authors":"Sven Kachel , Adrian P. Simpson , Melanie C. Steffens","doi":"10.1016/j.specom.2023.103023","DOIUrl":"10.1016/j.specom.2023.103023","url":null,"abstract":"<div><p>Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The present study aims to investigate whether and how women differing in sexual orientation vary in their speaking behavior, especially mean fundamental frequency (f0), in the presence of a female vs. male experimenter. Lesbian (<em>n</em> = 19) and straight female speakers (<em>n</em> = 18) engaged in two interactions each: First, they either engaged with a female or male experimenter, and second with the other-gender experimenter (counter-balanced and random assignment to conditions). For each interaction, recordings of read and spontaneous speech were collected. Analyses of read speech demonstrated mirroring of the first experimenter’s mean f0 which persisted even in the presence of the second experimenter. In spontaneous speech, this order effect interacted with exclusiveness of sexual orientation: Mirroring was found for participants who reported being exclusively lesbian/straight, not for those who reported being mainly lesbian/straight. We discuss implications for studies on convergence and research practice in general.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103023"},"PeriodicalIF":3.2,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001577/pdfft?md5=e1280cda33f537c756c6ad3e4b34309d&pid=1-s2.0-S0167639323001577-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138542558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN","authors":"Claudio Fernandez-Martín , Adrian Colomer , Claudio Panariello , Valery Naranjo","doi":"10.1016/j.specom.2023.103022","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103022","url":null,"abstract":"<div><p>Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices between multiple speakers. However, the training stability of GANs can be an issue. The Top-K methodology, which trains the generator using only the best <span><math><mi>K</mi></math></span> generated samples that “fool” the discriminator, has been applied to image tasks and simple GAN architectures. In this work, we demonstrate that the Top-K methodology can improve the quality and stability of converted voices in a state-of-the-art voice conversion system like StarGAN-VC. We also explore the optimal time to implement the Top-K methodology and how to reduce the value of <span><math><mi>K</mi></math></span> during training. Through both quantitative and qualitative studies, it was found that the Top-K methodology leads to quicker convergence and better conversion quality compared to regular or vanilla training. In addition, human listeners perceived the samples generated using Top-K as more natural and were more likely to believe that they were produced by a human speaker. The results of this study demonstrate that the Top-K methodology can effectively improve the performance of deep learning-based voice conversion systems.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103022"},"PeriodicalIF":3.2,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001565/pdfft?md5=74a68a8324a3af4dc4558e4166e99f23&pid=1-s2.0-S0167639323001565-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minying Liu , Alex Noel Joseph Raj , Vijayarajan Rajangam , Kunwu Ma , Zhemin Zhuang , Shuxin Zhuang
{"title":"Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition","authors":"Minying Liu , Alex Noel Joseph Raj , Vijayarajan Rajangam , Kunwu Ma , Zhemin Zhuang , Shuxin Zhuang","doi":"10.1016/j.specom.2023.103010","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103010","url":null,"abstract":"<div><p>Speech emotion recognition (SER) is a crucial field of research in artificial intelligence and human–computer interaction. Extracting effective speech features for emotion recognition is a continuing research focus in SER. Most research has focused on finding an optimal speech feature to extract hidden local features while ignoring the global relationships of the speech signal. In this paper, we propose a method that utilizes a multiscale-multichannel feature extraction structure with global and local information to obtain comprehensive speech features. Our approach employs a one-dimensional convolutional neural network (1D CNN) for feature learning and emotion recognition, capturing both spectral and spatial characteristics of speech for superior learning capabilities with improved SER results. We conducted extensive experiments on publicly available emotion recognition datasets, employing three distinct data augmentation (DA) techniques to enhance model generalization. Our model utilized Mel-frequency cepstral coefficients and zero-crossing rate features from speech samples for training and outperformed state-of-the-art techniques in terms of accuracy. Additionally, we conducted experiments to validate the effectiveness and reliability of our proposed method.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103010"},"PeriodicalIF":3.2,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001449/pdfft?md5=a1739967e793340202a345ded16beeca&pid=1-s2.0-S0167639323001449-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138430705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selective transfer subspace learning for small-footprint end-to-end cross-domain keyword spotting","authors":"Fei Ma, Chengliang Wang, Xusheng Li, Zhuo Zeng","doi":"10.1016/j.specom.2023.103019","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103019","url":null,"abstract":"<div><p>In small-footprint end-to-end keyword spotting, it is often expensive and time-consuming to acquire sufficient labels in various speech scenarios. To overcome this problem, transfer learning leverages the rich knowledge of the auxiliary domain to annotate the unlabeled target data. However, most existing transfer learning methods typically learn a domain-invariant feature representation while ignoring the negative transfer problem. In this paper, we propose a new and general cross-domain keyword spotting framework called selective transfer subspace learning (STSL) that avoid negative transfer and dramatically improve the accuracy for cross-domain keyword spotting by actively selecting appropriate source samples. Specifically, STSL first aligns geometrical relationship and weighted distribution discrepancy to learn a domain-invariant projection subspace. Then, it actively selects appropriate source samples that are similar to the target domain for transfer learning to avoid negative transfer. Finally, we formulate a minimization problem that alternately optimizes the projection subspace and source active selection, giving an effective optimization. Experimental results on 10 groups of cross-domain keyword spotting tasks show that our STSL outperforms some state-of-the-art transfer learning methods and no transfer learning methods.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103019"},"PeriodicalIF":3.2,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016763932300153X/pdfft?md5=82a39d003305603c8c276ad8a7d9c674&pid=1-s2.0-S016763932300153X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138448358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}