Saska Tirronen , Farhad Javanmardi , Hilla Pohjalainen , Sudarsana Reddy Kadiri , Kiran Reddy Mittapalle , Pyry Helkkula , Kasimir Kaitue , Mikko Minkkinen , Heli Tolppanen , Tuomo Nieminen , Paavo Alku
{"title":"Towards robust heart failure detection in digital telephony environments by utilizing transformer-based codec inversion","authors":"Saska Tirronen , Farhad Javanmardi , Hilla Pohjalainen , Sudarsana Reddy Kadiri , Kiran Reddy Mittapalle , Pyry Helkkula , Kasimir Kaitue , Mikko Minkkinen , Heli Tolppanen , Tuomo Nieminen , Paavo Alku","doi":"10.1016/j.specom.2025.103279","DOIUrl":"10.1016/j.specom.2025.103279","url":null,"abstract":"<div><div>This study introduces the Codec Transformer Network (CTN) to enhance the reliability of automatic heart failure (HF) detection from coded telephone speech by addressing codec-related challenges in digital telephony. The study specifically addresses the codec mismatch between training and inference in HF detection. CTN is designed to map the mel-spectrogram representations of encoded speech signals back to their original, non-encoded forms, thereby recovering HF-related discriminative information. The effectiveness of CTN is demonstrated in conjunction with three HF detectors, based on Support Vector Machine, Random Forest, and K-Nearest Neighbors classifiers. The results show that CTN effectively retrieves the discriminative information between patients and controls, and performs comparably to or better than a baseline approach, based on multi-condition training.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103279"},"PeriodicalIF":2.4,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Wang , Jianjun Lei , Xiangwei Zhu , Tao Zhang
{"title":"Multimodal speech emotion recognition via modality constraint with hierarchical bottleneck feature fusion","authors":"Ying Wang , Jianjun Lei , Xiangwei Zhu , Tao Zhang","doi":"10.1016/j.specom.2025.103278","DOIUrl":"10.1016/j.specom.2025.103278","url":null,"abstract":"<div><div>Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103278"},"PeriodicalIF":2.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-native (Czech and Russian L1) auditor assessments of some English suprasegmental features: Prominence and pitch accents","authors":"Alexey Tymbay","doi":"10.1016/j.specom.2025.103281","DOIUrl":"10.1016/j.specom.2025.103281","url":null,"abstract":"<div><div>This study reports on a comparative perceptual experiment investigating the ability of Russian and Czech advanced learners of English to identify prominence in spoken English. Two groups of non-native annotators completed prominence marking tasks on English monologues, both before and after undergoing a 12-week phonological training program. The study employed three annotation techniques: Rapid Prosody Transcription (RPT), traditional (British), and ToBI. While the RPT annotations produced by the focus groups did not reach statistical equivalence with those of native English speakers, the data indicate a significant improvement in the perception and categorization of prominence following phonological training. A recurrent difficulty observed in both groups was the accurate identification of prenuclear prominence. This is attributed to prosodic transfer effects from the participants’ first languages, Russian and Czech. The study highlights that systemic, phonetic, and distributional differences in the realization of prominence between L1 and L2 may hinder accurate perceptual judgments in English. It further posits that Russian and Czech speakers rely on different acoustic cues for prominence marking in their native languages, and that these cue-weighting strategies are transferred to English. Nevertheless, the results demonstrate that targeted phonological instruction can substantially enhance L2 learners’ perceptual sensitivity to English prosody.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103281"},"PeriodicalIF":2.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wu , Jun Liu , Ting Wang , Sunghye Cho , Yong-cheol Lee
{"title":"Comparisons of Mandarin on-focus expansion and post-focus compression between native speakers and L2 learners: Production and machine learning classification","authors":"Jing Wu , Jun Liu , Ting Wang , Sunghye Cho , Yong-cheol Lee","doi":"10.1016/j.specom.2025.103280","DOIUrl":"10.1016/j.specom.2025.103280","url":null,"abstract":"<div><div>Korean and Mandarin are reported to have on-focus expansion and post-focus compression in marking prosodic focus. It is not clear whether Korean L2 learners of Mandarin benefit from this prosodic similarity in the production of focused tones or encounter difficulty due to the interaction between tone and intonation in a tonal language. This study examined the prosodic focus of Korean L2 learners of Mandarin through a production experiment, followed by the development of a machine learning classification to automatically detect learners’ production of focused elements. Learners were divided into two groups according to proficiency level (advanced and intermediate) and were directly compared with Mandarin native speakers. Production results showed that intermediate-level speakers did not show any systemic modulations for focus marking. Although the advanced-level speakers performed better than the intermediate group, their prosodic effects of focus were significantly different from those of native speakers in both focus and post-focus positions. The machine learning classification of focused elements reflected clear focus-cueing differences among the three groups. The accuracy rate was about 86 % for the native speakers, 49 % for the advanced learners, and about 34 % for the intermediate learners. The results of this study suggest that on-focus expansion and post-focus compression are not automatically transferred across languages, even when those languages share similar acoustic correlates of prosodic focus. This study also underscores that the difficulty in acquiring the prosodic structure of a tone language lies mainly in mastering tone acquisition, which impacts non-tonal language learners, leading to ineffective performance of on-focus expansion and post-focus compression.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103280"},"PeriodicalIF":2.4,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144695150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight online punctuation and capitalization restoration for streaming ASR systems","authors":"Martin Polacek, Petr Cerva, Jindrich Zdansky","doi":"10.1016/j.specom.2025.103269","DOIUrl":"10.1016/j.specom.2025.103269","url":null,"abstract":"<div><div>This work proposes a lightweight online approach to automatic punctuation and capitalization restoration (APCR). Our method takes pure text as input and can be utilized in real-time speech transcription systems for, e.g., live captioning of TV or radio streams. We develop and evaluate it in a series of consecutive experiments, starting with the task of automatic punctuation restoration (APR). Within that, we also compare our results to another real-time APR method, which combines textual and acoustic features. The test data that we use for this purpose contains automatic transcripts of radio talks and TV debates. In the second part of the paper, we extend our method towards the task of automatic capitalization restoration (ACR). The resulting approach uses two consecutive ELECTRA-small models complemented by simple classification heads; the first ELECTRA model restores punctuation, while the second performs capitalization. Our complete system allows for restoring question marks, commas, periods, and capitalization with a very short inference time and a low latency of just four words. We evaluate its performance for Czech and German, and also compare its results to those of another existing APCR system for English. We are also publishing the data used for our evaluation and testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103269"},"PeriodicalIF":2.4,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the nuances of reduction in conversational speech: lexicalized and non-lexicalized reductions","authors":"Kübra Bodur , Corinne Fredouille , Stéphane Rauzy , Christine Meunier","doi":"10.1016/j.specom.2025.103268","DOIUrl":"10.1016/j.specom.2025.103268","url":null,"abstract":"<div><div>In spoken language, a significant proportion of words are produced with missing or underspecified segments, a phenomenon known as reduction. In this study, we distinguish two types of reductions in spontaneous speech: <em>lexicalized</em> reductions, which are well-documented, regularly occurring forms driven primarily by lexical processes, and <em>non-lexicalized</em> reductions, which occur irregularly and lack consistent patterns or representations. The latter are inherently more difficult to detect, and existing methods struggle to capture their full range.</div><div>We introduce a novel bottom-up approach for detecting potential reductions in French conversational speech, complemented by a top-down method focused on detecting previously known reduced forms. Our bottom-up method targets sequences consisting of at least six phonemes produced within a 230 ms window, identifying temporally condensed segments, indicative of reduction.</div><div>Our findings reveal significant variability in reduction patterns across the corpus. Lexicalized reductions displayed relatively stable and consistent ratios, whereas non-lexicalized reductions varied substantially and were strongly influenced by speaker characteristics. Notably, gender had a significant effect on non-lexicalized reductions, with male speakers showing higher reduction ratios, while no such effect was observed for lexicalized reductions. The two reduction types were influenced differently by speaking time and articulation rate. A positive correlation between lexicalized and non-lexicalized reduction ratios suggested speaker-specific tendencies.</div><div>Non-lexicalized reductions showed a higher prevalence of certain phonemes and word categories, whereas lexicalized reductions were more closely linked to morpho-syntactic roles. In a focused investigation of selected lexicalized items, we found that “tu sais” was more frequently reduced when functioning as a discourse marker than when used as a pronoun + verb construction. These results support the interpretation that lexicalized reductions are integrated into the mental lexicon, while non-lexicalized reductions are more context-dependent, further supporting the distinction between the two types of reductions.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103268"},"PeriodicalIF":2.4,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144534682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosodic modulation of discourse markers: A cross-linguistic analysis of conversational dynamics","authors":"Yi Shan","doi":"10.1016/j.specom.2025.103271","DOIUrl":"10.1016/j.specom.2025.103271","url":null,"abstract":"<div><div>This paper delves into the fascinating world of prosody and pragmatics in discourse markers (DMs). We have come a long way since the early structural approaches, and now we are exploring dynamic models that reveal how prosody shapes DM interpretation in spoken discourse. Our journey takes us through various research methods, from acoustic analysis to naturalistic observations, each offering unique insights into how intonation, stress, and rhythm interact with DMs to guide conversations. Recent cross-linguistic studies, such as Ahn et al. (2024) on Korean “<em>nay mali</em>” and Wang et al. (2024) on Mandarin “<em>haole</em>,” demonstrate how prosodic detachment and contextual cues facilitate the evolution of DMs from lexical to pragmatic functions, underscoring the interplay between prosody and discourse management. Further cross-linguistic evidence comes from Vercher’s (2023) analysis of Spanish “<em>entonces</em>” and Siebold’s (2021) study on German “<em>dann</em>,” which highlight language-specific prosodic realizations of DMs in turn management and conversational closings. We are also looking at cross-linguistic patterns to uncover both universal trends and language-specific characteristics. It is amazing how cultural context plays such a crucial role in prosodic analysis. Besides, machine learning and AI are revolutionizing the field, allowing us to analyze prosodic features in massive datasets with unprecedented precision. We are now embracing multimodal analysis by combining prosody with non-verbal cues for a more holistic understanding of DMs in face-to-face communication. These findings have real-world applications, from improving speech recognition to enhancing language teaching methods. Looking ahead, we are advocating for an integrated approach that considers the dynamic interplay between prosody, pragmatics, and social context. There is still so much to explore across linguistic boundaries and diverse communicative settings. This review is not just a state-of-the-art overview. Rather, it is a roadmap for future research in this exciting field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103271"},"PeriodicalIF":2.4,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144518752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon
{"title":"Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation","authors":"Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon","doi":"10.1016/j.specom.2025.103270","DOIUrl":"10.1016/j.specom.2025.103270","url":null,"abstract":"<div><div>This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (<em>n</em>=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (<em>n</em>=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, <em>p</em><0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103270"},"PeriodicalIF":2.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144510758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech stimulus continuum synthesis using deep learning methods","authors":"Zhu Li, Yuqing Zhang, Yanlu Xie","doi":"10.1016/j.specom.2025.103266","DOIUrl":"10.1016/j.specom.2025.103266","url":null,"abstract":"<div><div>Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103266"},"PeriodicalIF":2.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The perception of intonational peaks and valleys: The effects of plateaux, declination and experimental task","authors":"Hae-Sung Jeon","doi":"10.1016/j.specom.2025.103267","DOIUrl":"10.1016/j.specom.2025.103267","url":null,"abstract":"<div><div>An experiment assessed listeners’ judgement of either relative pitch height or prominence between two consecutive fundamental frequency (<em>f<sub>o</sub></em>) peaks or valleys in speech. The <em>f<sub>o</sub></em> contour of the first peak or valley was kept constant, while the second was orthogonally manipulated in its height and plateau duration. Half of the stimuli had a flat baseline from which the peaks and valleys were scaled, while the other half had an overtly declining baseline. The results replicated the previous finding that <em>f<sub>o</sub></em> peaks with a long plateau are salient to listeners, while valleys are hard to process even with a plateau. Furthermore, the effect of declination was dependent on the experimental task. Listeners’ responses seemed to be directly affected by the <em>f<sub>o</sub></em> excursion size only for judging relative height between two peaks, while their prominence judgement was strongly affected by the overall impression of the pitch raising or lowering event near the perceptual target. The findings suggest that the global <em>f<sub>o</sub></em> contour, not a single representative <em>f<sub>o</sub></em> value of an intonational event, should be considered in perceptual models of intonation. The findings show an interplay between the signal, listeners’ top-down expectations, and speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103267"},"PeriodicalIF":2.4,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}