{"title":"Effects of voice onset time and place of articulation on perception of dichotic Turkish syllables","authors":"Emre Eskicioglu , Serhat Taslica , Cagdas Guducu , Adile Oniz , Murat Ozgoren","doi":"10.1016/j.specom.2024.103170","DOIUrl":"10.1016/j.specom.2024.103170","url":null,"abstract":"<div><div>Dichotic listening has been widely used in research investigating the hemispheric specialization of language. A common finding is the Right-ear Advantage (REA), reflecting left hemisphere speech sound perception specialization. However, acoustic/phonetic features of the stimuli, such as voice onset time (VOT) and place of articulation (POA), are known to affect the REA. This study investigates the effects of these features on the REA in the Turkish language, whose language family differs from the languages typically used in previous VOT and POA studies. Data of 95 right-handed participants with REA, which was defined as reporting at least one more correct right than left ear response, were analyzed. Prevoiced consonants were dominant compared with consonants with long VOT and resulted in increased REA. Velar consonants were dominant compared with other consonants. Velar and alveolar consonants resulted in higher REA than bilabial consonants. Lateralization and error rates were lower when POA, but not VOT, of the consonants differed. Error responses were mostly determined by the VOT feature of the consonant presented to the right ear. To conclude, the effects of VOT and PoA on the hemispheric asymmetry in Turkish have been spotted by a behavioral approach. Further neuroimaging or electrophysiologic investigations are needed to validate and shed light into the underlying mechanisms of VOT and PoA effects during the DL test.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103170"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spoken language identification: An overview of past and present research trends","authors":"Douglas O'Shaughnessy","doi":"10.1016/j.specom.2024.103167","DOIUrl":"10.1016/j.specom.2024.103167","url":null,"abstract":"<div><div>Identification of the language used in spoken utterances is useful for multiple applications, e.g., assist in directing or automating telephone calls, or selecting which language-specific speech recognizer to use. This paper reviews modern methods of automatic language identification. It examines what information in speech helps to distinguish among languages, and extends these ideas to dialect estimation as well. As approaches to recognize languages often share much in common with both automatic speech recognition and speaker verification, these three processes are compared. Many methods are drawn from pattern recognition research in other areas, such as image and text recognition. This paper notes how speech is different from most other signals to recognize, and how language identification differs from other speech applications. While it is mainly addressed to readers who are not experts in speech processing (as detailed algorithms, readily found in the cited literature, are omitted here), the presentation covers a wide discussion useful to experts too.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103167"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Systematic review: The identification of segmental Mandarin-accented English features","authors":"Hongzhi Wang, Rachael-Anne Knight, Lucy Dipper, Roy Alderton, Reem S․ W․ Alyahya","doi":"10.1016/j.specom.2024.103168","DOIUrl":"10.1016/j.specom.2024.103168","url":null,"abstract":"<div><h3>Background</h3><div>The pronunciation of L2 English by L1 Mandarin speakers is influenced by transfer effects from the phonology of Mandarin. However, there is a research gap in systematically synthesizing and reviewing segmental Mandarin-accented English features (SMAEFs) from the existing literature. An accurate and comprehensive description of SMAEFs is necessary for applied science in relevant fields.</div></div><div><h3>Aim</h3><div>To identify the segmental features that are most consistently described as characteristic of Mandarin-accented English in previous literature.</div></div><div><h3>Methods</h3><div>A systematic review was conducted. The studies were identified through searching in nine databases with eight screening criteria.</div></div><div><h3>Results</h3><div>The systematic review includes nineteen studies with a total of 1,873 Mandarin English speakers. The included studies yield 45 SMAEFs, classified into Vowel and Consonant categories, under which there are multiple sub-categories. The results are supported by evidence of different levels of strength. The four frequently reported findings, which are 1) variations in vowel height and frontness, 2) schwa epenthesis, 3) variations in closure duration in plosives and 4) illegal consonant deletion, were identified and analyzed in terms of their potential intelligibility outcomes.</div></div><div><h3>Conclusion</h3><div>The number of SMAEFs is large. These features occur in numerous traditional phonetic categories and two categories (i.e. schwa epenthesis and illegal consonant deletion) that are typically used to describe features in connected speech. The study outcomes may provide valuable insights for researchers and practitioners in the fields of English Language Teaching, phonetics, and speech recognition system development in terms of selecting the pronunciation features to focus on in teaching and research or supporting the successful identification of accented features.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103168"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinlei Ma , Ruiteng Zhang , Jianguo Wei , Xugang Lu , Junhai Xu , Lin Zhang , Wenhuan Lu
{"title":"Self-distillation-based domain exploration for source speaker verification under spoofed speech from unknown voice conversion","authors":"Xinlei Ma , Ruiteng Zhang , Jianguo Wei , Xugang Lu , Junhai Xu , Lin Zhang , Wenhuan Lu","doi":"10.1016/j.specom.2024.103153","DOIUrl":"10.1016/j.specom.2024.103153","url":null,"abstract":"<div><div>Advancements in voice conversion (VC) technology have made it easier to generate spoofed speech that closely resembles the identity of a target speaker. Meanwhile, verification systems within the realm of speech processing are widely used to identify speakers. However, the misuse of VC algorithms poses significant privacy and security risks by potentially deceiving these systems. To address this issue, source speaker verification (SSV) has been proposed to verify the source speaker’s identity of the spoofed speech generated by VCs. Nevertheless, SSV often suffers severe performance degradation when confronted with unknown VC algorithms, which is usually neglected by researchers. To deal with this cross-voice-conversion scenario and enhance the model’s performance when facing unknown VC methods, we redefine it as a novel domain adaptation task by treating each VC method as a distinct domain. In this context, we propose an unsupervised domain adaptation (UDA) algorithm termed self-distillation-based domain exploration (SDDE). This algorithm adopts a siamese framework with two branches: one trained on the source (known) domain and the other trained on the target domains (unknown VC methods). The branch trained on the source domain leverages supervised learning to capture the source speaker’s intrinsic features. Meanwhile, the branch trained on the target domain employs self-distillation to explore target domain information from multi-scale segments. Additionally, we have constructed a large-scale data set comprising over 7945 h of spoofed speech to evaluate the proposed SDDE. Experimental results on this data set demonstrate that SDDE outperforms traditional UDAs and substantially enhances the performance of the SSV model under unknown VC scenarios. The code for data generation and the trial lists are available at <span><span>https://github.com/zrtlemontree/cross-domain-source-speaker-verification</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103153"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved AED with multi-stage feature extraction and fusion based on RFAConv and PSA","authors":"Bingbing Wang, Yangjie Wei, Zhuangzhuang Wang, Zekang Qi","doi":"10.1016/j.specom.2024.103166","DOIUrl":"10.1016/j.specom.2024.103166","url":null,"abstract":"<div><div>End-to-end speech recognition systems based on the Attention-based Encoder-Decoder (AED) model normally achieve high accuracy because they concurrently consider the previously generated tokens and contextual features of speech signals. However, the spatial, positional, and multiscale information during shallow feature extraction is mostly neglected, and the shallow and deep features are rarely effectively fused. These problems seriously limit the accuracy and speed of speech recognition in real applications. This study proposes a multi-stage feature extraction and fusion method tailored for end-to-end speech recognition systems based on the AED model. Initially, the receptive-field attention convolutional module is introduced into the front-end feature extraction stage of AED. This module employs a receptive field attention mechanism to enhance the model's feature extraction capability by focusing on the positional and spatial information of speech signals. Moreover, a pyramid squeeze attention mechanism is incorporated into the encoder module to effectively merge the shallow and deep features, and feature maps are recalibrated through weight learning to enhance the accuracy of the encoder's output features. Finally, the effectiveness and robustness of our method are validated across various end-to-end speech recognition models. The experimental results prove that our improved AED speech recognition models with multi-stage feature extraction and fusion achieve a lower word error rate without a language model, and their transcriptions are more accurate and grammatically precise.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"167 ","pages":"Article 103166"},"PeriodicalIF":2.4,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143128501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahong Ye , Diqun Yan , Songyin Fu , Bin Ma , Zhihua Xia
{"title":"One-class network leveraging spectro-temporal features for generalized synthetic speech detection","authors":"Jiahong Ye , Diqun Yan , Songyin Fu , Bin Ma , Zhihua Xia","doi":"10.1016/j.specom.2025.103200","DOIUrl":"10.1016/j.specom.2025.103200","url":null,"abstract":"<div><div>Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103200"},"PeriodicalIF":2.4,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143103333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyue Shi , Qinglin Meng , Huali Zhou , Jiawen Li , Yefei Mo , Nengheng Zheng
{"title":"Effects of harmonicity on Mandarin speech perception in cochlear implant users","authors":"Mingyue Shi , Qinglin Meng , Huali Zhou , Jiawen Li , Yefei Mo , Nengheng Zheng","doi":"10.1016/j.specom.2025.103199","DOIUrl":"10.1016/j.specom.2025.103199","url":null,"abstract":"<div><div>Previous research has demonstrated the negligible impact of harmonicity on English speech perception for normal hearing (NH) listeners in quiet environments. This study aims to bridge the gap in understanding the role of harmonicity in Mandarin speech perception for cochlear implant (CI) users. Speech perception in quiet was tested in both CI simulation group and actual CI user group using harmonic and inharmonic Mandarin speech. Furthermore, speech-on-speech perception was tested in NH, CI simulation, and actual CI user groups. For speech perception in quiet, results show that, compared to harmonic speech, inharmonic speech decreased the mean recognition rate for both actual CI user and CI simulation groups by about 10 percentage points. For speech-on-speech perception, all groups (i.e., NH, CI simulation, and actual CI user) performed worse with inharmonic stimuli compared to harmonic stimuli. The findings of this study, along with previous studies in NH listeners, indicate that harmonicity aids target speech recognition for NH listeners in speech-on-speech conditions but not speech perception in quiet. In contrast, harmonicity plays an important role in CI users’ Mandarin speech recognition in both quiet and speech-on-speech conditions. However, under speech-on-speech conditions, CI users could only understand target speech at positive SNRs (often <span><math><mo>></mo></math></span> 5 dB), suggesting that their performance depends on the intelligibility of the target speech. The contribution of harmonicity to masking release in CI users remains unclear.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103199"},"PeriodicalIF":2.4,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143103334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weiquan Fan , Xiangmin Xu , Guohua Zhou , Xiaofang Deng , Xiaofen Xing
{"title":"Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition","authors":"Weiquan Fan , Xiangmin Xu , Guohua Zhou , Xiaofang Deng , Xiaofen Xing","doi":"10.1016/j.specom.2025.103198","DOIUrl":"10.1016/j.specom.2025.103198","url":null,"abstract":"<div><div>Emotion recognition is crucial to improve the human–computer interaction experience. Attention mechanisms have become a mainstream technique due to their excellent ability to capture emotion representations. Existing algorithms often employ self-attention and cross-attention for multimodal interactions, which artificially set specific attention patterns at specific layers of the model. However, it is uncertain which attention mechanism is more important in different layers of the model. In this paper, we propose a Coordination Attention based Transformers (CAT). Based on the dual attention paradigm, CAT dynamically infers the pass rates of self-attention and cross-attention layer by layer, coordinating the importance of intra-modal and inter-modal factors. Further, we propose a bidirectional contrastive loss to cluster the matching pairs between modalities and push the mismatching pairs farther apart. Experiments demonstrate the effectiveness of our method, and the state-of-the-art performance is achieved under the same experimental conditions.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103198"},"PeriodicalIF":2.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143164900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Dan , Kele Xu , Yihang Zhou , Chuanguang Yang , Yihao Chen , Yutao Dou , Cheng Yang
{"title":"Spatio-temporal masked autoencoder-based phonetic segments classification from ultrasound","authors":"Xi Dan , Kele Xu , Yihang Zhou , Chuanguang Yang , Yihao Chen , Yutao Dou , Cheng Yang","doi":"10.1016/j.specom.2025.103186","DOIUrl":"10.1016/j.specom.2025.103186","url":null,"abstract":"<div><div>The integration of Ultrasound Tongue Imaging (UTI) into clinical linguistics and phonetics research facilitates the examination of articulatory patterns and the correlation between speech sounds and their physical manifestations. This proves highly advantage for diagnosing speech disorders and improving the study for speech production and silent speech recognition. In recent years, self-supervised learning (SSL) has gathered attention as a cost-effective approach for analyzing UTI data. However, it is noteworthy that most existing SSL models often do not fully exploit the contextual information embedded within UTI sequences. To tackle this challenge, we present a novel SSL framework for UTI classification that capitalizes on both the pre-training and fine-tuning phases. Specifically, we propose spatio-temporal masking to harness contextual information during pre-training, thus reducing the need for human annotation. Besides, we insert token shift module into the encoder to enhance the model representation of the spatio-temporal features of tongue movements in UTI sequences. Additionally, to imitate the decision path of the domain experts, we apply hard example mining techniques during fine-tuning to augment the performance of the model. The experimental results on a publicly available dataset demonstrate that our proposed method outperforms other competitive methods in UTI classification tasks, which underscores the potential of our approach to enhance the analysis and interpretation of UTI data. Our code is available at <span><span>https://github.com/colaudiolab/USenhance.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103186"},"PeriodicalIF":2.4,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143103338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C.T. Justine Hui , Hinako Masuda , Eri Osawa , Takayuki Arai , Catherine I. Watson , Yusuke Hioka
{"title":"Role of language familiarity in understanding speech in noise under various acoustic environments","authors":"C.T. Justine Hui , Hinako Masuda , Eri Osawa , Takayuki Arai , Catherine I. Watson , Yusuke Hioka","doi":"10.1016/j.specom.2025.103195","DOIUrl":"10.1016/j.specom.2025.103195","url":null,"abstract":"<div><div>We communicate in complex acoustic environments in everyday life but our familiarity with the language can affect how well we can understand speech in these environments. The current study examines the role of language familiarity in understanding speech in varying acoustic environments via a speech intelligibility test conducted under anechoic and reverberant conditions with various speech-noise separation angles. Four groups were recruited with differing level of language familiarity: first language (L1) New Zealand English (NZE) listeners, second language (L2) Japanese native listeners with exposure to NZE, L2 Japanese native listeners with overseas English experiences without exposure to NZE, and Japanese native listeners who have learnt English as a foreign language (FL) without overseas English experiences.</div><div>The L1 group performed better in overall speech intelligibility performance compared to the 3 Japanese native groups. Contrary to previous literature where non-native listeners were found to have a similar benefit from spatial separation to native listeners, this was not the case for the FL group, suggesting that this benefit is only available for listeners with a certain level of language familiarity. While there were differences between L2 and FL groups in the anechoic condition, these differences become marginal in the reverberant conditions for the two groups with little exposure to NZE. This suggests that familiarity to the specific language variety has an advantage in acoustically adverse environments.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"169 ","pages":"Article 103195"},"PeriodicalIF":2.4,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143164899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}