{"title":"Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition","authors":"Siba Prasad Mishra, Pankaj Warule, Suman Deb","doi":"10.1016/j.specom.2024.103148","DOIUrl":"10.1016/j.specom.2024.103148","url":null,"abstract":"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103148"},"PeriodicalIF":2.4,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection","authors":"Yida Huang, Qian Shen, Jianfen Ma","doi":"10.1016/j.specom.2024.103149","DOIUrl":"10.1016/j.specom.2024.103149","url":null,"abstract":"<div><div>The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103149"},"PeriodicalIF":2.4,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A robust temporal map of speech monitoring from planning to articulation","authors":"Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist","doi":"10.1016/j.specom.2024.103146","DOIUrl":"10.1016/j.specom.2024.103146","url":null,"abstract":"<div><div>Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.</div><div>In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.</div><div>As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.</div><div>Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103146"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones","authors":"Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang","doi":"10.1016/j.specom.2024.103147","DOIUrl":"10.1016/j.specom.2024.103147","url":null,"abstract":"<div><div>Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of <em>d’</em> values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher <em>d’</em> value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103147"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142656235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners","authors":"Benjamin O’Brien , Christine Meunier , Alain Ghio","doi":"10.1016/j.specom.2024.103145","DOIUrl":"10.1016/j.specom.2024.103145","url":null,"abstract":"<div><div>A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at <span><math><mo>±</mo></math></span> 600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103145"},"PeriodicalIF":2.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142527431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May
{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103141"},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elisa Pellegrino , Volker Dellwo , Jennifer S. Pardo , Bernd Möbius
{"title":"Forms, factors and functions of phonetic convergence: Editorial","authors":"Elisa Pellegrino , Volker Dellwo , Jennifer S. Pardo , Bernd Möbius","doi":"10.1016/j.specom.2024.103142","DOIUrl":"10.1016/j.specom.2024.103142","url":null,"abstract":"<div><div>This introductory article for the Special Issue on Forms, Factors and Functions of Phonetic Convergence offers a comprehensive overview of the dominant theoretical paradigms, elicitation methods, and computational approaches pertaining to phonetic convergence, and discusses the role of established factors shaping interspeakers’ acoustic adjustments. The nine papers in this collection offer new insights into the fundamental mechanisms, factors and functions behind accommodation in production and perception, and in the perception of accommodation. By integrating acoustic, articulatory and perceptual evaluations of convergence, and combining traditional experimental phonetic analysis with computational modeling, the nine papers (1) emphasize the roles of cognitive adaptability and phonetic variability as triggers for convergence, (2) reveal fundamental similarities between the mechanisms of convergence perception and speaker identification, and (3) shed light on the evolutionary link between adaptation in human and animal vocalizations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103142"},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shumit Saha , Keerthana Viswanathan , Anamika Saha , Azadeh Yadollahi
{"title":"Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study","authors":"Shumit Saha , Keerthana Viswanathan , Anamika Saha , Azadeh Yadollahi","doi":"10.1016/j.specom.2024.103144","DOIUrl":"10.1016/j.specom.2024.103144","url":null,"abstract":"<div><div>Assessment of upper airway dimensions has shown great promise in understanding the pathogenesis of obstructive sleep apnea (OSA). However, the current screening system for OSA does not have an objective assessment of the upper airway. The assessment of the upper airway can accurately be performed by MRI or CT scans, which are costly and not easily accessible. Acoustic pharyngometry or Ultrasonography could be less expensive technologies, but these require trained personnel which makes these technologies not easily accessible, especially when assessing the upper airway in a clinic environment or before surgery. In this study, we aimed to investigate the utility of vowel articulation in assessing the upper airway dimension during normal breathing. To accomplish that, we measured the upper airway cross-sectional area (UA-XSA) by acoustic pharyngometry and then asked the participants to produce 5 vowels for 3 s and recorded them with a microphone. We extracted 710 acoustic features from all vowels and compared these features with UA-XSA and developed regression models to estimate the UA-XSA. Our results showed that Mel frequency cepstral coefficients (MFCC) were the most dominant features of vowels, as 7 out of 9 features were from MFCC in the main feature set. The multiple regression analysis showed that the combination of the acoustic features with the anthropometric features achieved an R<sup>2</sup> of 0.80 in estimating UA-XSA. The important advantage of acoustic analysis of vowel sounds is that it is simple and can be easily implemented in wearable devices or mobile applications. Such acoustic-based technologies can be accessible in different clinical settings such as the intensive care unit and can be used in remote areas. Thus, these results could be used to develop user-friendly applications to use the acoustic features and demographical information to estimate the UA-XSA.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103144"},"PeriodicalIF":2.4,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao
{"title":"Zero-shot voice conversion based on feature disentanglement","authors":"Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao","doi":"10.1016/j.specom.2024.103143","DOIUrl":"10.1016/j.specom.2024.103143","url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103143"},"PeriodicalIF":2.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-modal co-learning for silent speech recognition based on ultrasound tongue images","authors":"Minghao Guo , Jianguo Wei , Ruiteng Zhang , Yu Zhao , Qiang Fang","doi":"10.1016/j.specom.2024.103140","DOIUrl":"10.1016/j.specom.2024.103140","url":null,"abstract":"<div><p>Silent speech recognition (SSR) is an essential task in human–computer interaction, aiming to recognize speech from non-acoustic modalities. A key challenge in SSR is inherent input ambiguity due to partial speech information absence in non-acoustic signals. This ambiguity leads to homophones-words with similar inputs yet different pronunciations. Current approaches address this issue either by utilizing richer additional inputs or training extra models for cross-modal embedding compensation. In this paper, we propose an effective multi-modal co-learning framework promoting the discriminative ability of silent speech representations via multi-stage training. We first construct the backbone of SSR using ultrasound tongue imaging (UTI) as the main modality and then introduce two auxiliary modalities: lip video and audio signals. Utilizing modality dropout, the model learns shared/specific features from all available streams creating a same semantic space for better generalization of the UTI representation. Given cross-modal unbalanced optimization, we highlight the importance of hyperparameter settings and modulation strategies in enabling modality-specific co-learning for SSR. Experimental results show that the modality-agnostic models with single UTI input outperform state-of-the-art modality-specific models. Confusion analysis based on phonemes/articulatory features confirms that co-learned UTI representations contain valuable information for distinguishing homophenes. Additionally, our model can perform well on two unseen testing sets, achieving cross-modal generalization for the uni-modal SSR task.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103140"},"PeriodicalIF":2.4,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142239519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}