Speech Communication最新文献_第4页

Cross-corpus speech emotion recognition using semi-supervised domain adaptation network

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-12-29 DOI: 10.1016/j.specom.2024.103185

Yumei Zhang, Maoshen Jia, Xuan Cao, Jiawei Ru, Xinfeng Zhang

{"title":"Cross-corpus speech emotion recognition using semi-supervised domain adaptation network","authors":"Yumei Zhang, Maoshen Jia, Xuan Cao, Jiawei Ru, Xinfeng Zhang","doi":"10.1016/j.specom.2024.103185","DOIUrl":"10.1016/j.specom.2024.103185","url":null,"abstract":"<div><div>Speech emotion recognition (SER) is an important topic in human-computer interactions. When the input data in training and testing sets comes from different corpora, there is a decline in the recognition performance. Therefore, a cross-corpus SER method based on semi-supervised domain adaptation network (SDAN) is proposed in this paper. Firstly, the input data in the training and testing sets is extracted to form the feature representations of the source and target domain. Then, a supervised feature dictionary-based domain adaptation is designed to reduce the difference in feature representations between the source and target domains under different emotions. Each dictionary is associated with the extracted feature representation of one certain domain, and independently aggregates samples under different emotions. Meanwhile, an unsupervised domain discriminator-based adversarial domain adaptation is designed to reduce the difference in representations between the two domains generally. Finally, an emotion classifier is constructed based on pre-trained EfficientNet-B0. To evaluate the performance of the proposed method, we conducted several experiments on three public emotional speech corpora. The comparative results show superiorities of the proposed method in learning domain-invariant and emotionally discriminative representations. The visualization results further demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"168 ","pages":"Article 103185"},"PeriodicalIF":2.4,"publicationDate":"2024-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143179733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating voice onset time in Pakistani English speech

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-12-26 DOI: 10.1016/j.specom.2024.103174

Abdul Malik Abbasi , Imtiaz Husain , Illahi Bakhsh , Neda Kameh Khosh , Ahlam Khan

{"title":"Investigating voice onset time in Pakistani English speech","authors":"Abdul Malik Abbasi , Imtiaz Husain , Illahi Bakhsh , Neda Kameh Khosh , Ahlam Khan","doi":"10.1016/j.specom.2024.103174","DOIUrl":"10.1016/j.specom.2024.103174","url":null,"abstract":"<div><div>The current study investigates the Voice Onset Time (VOT) of Pakistani English (PE) Speech and Sindhi L1. The study hypothesizes that PE speakers transmit their L1 negative VOT to L2 English-voiced stops, generate English plosives with shorter pre-voicing durations than their L1-voiced plosives, and that their characteristics modify depending on place of articulation and gender. The stimuli were L2 English coronal and dorsal allophones, namely labial [pʰ], coronal [tʰ], and velar [kʰ], while Sindhi L1 distinct phonemes as aspirated labial /pʰ/, retroflex / ʈʰ/, velar / kʰ/ consonants: bilabial /b/, alveolar /d/, and velar /ɡ/. Voice Onset Time is an important acoustic element in the generation of plosives and has been extensively investigated in numerous languages. Machine learning modeling of VOT in second language (L2) learning yields useful data in phonetics, speech processing, and linguistics. To analyze and understand the data, the study applies advanced computer techniques such as speech recognition and machine learning modelling. This study presents useful insights into the Voice Onset Time patterns and variances in the two languages, which can aid in the development of better speech recognition algorithms and language teaching materials. The sample size was thirty individuals-Sindhi speaking second language learners who recorded voice samples, and the results confirmed the hypotheses.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"168 ","pages":"Article 103174"},"PeriodicalIF":2.4,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143179476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid dual-path network: Singing voice separation in the waveform domain by combining Conformer and Transformer architectures

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-12-25 DOI: 10.1016/j.specom.2024.103171

Chunxi Wang , Maoshen Jia , Meiran Li , Yong Ma , Dingding Yao

{"title":"Hybrid dual-path network: Singing voice separation in the waveform domain by combining Conformer and Transformer architectures","authors":"Chunxi Wang , Maoshen Jia , Meiran Li , Yong Ma , Dingding Yao","doi":"10.1016/j.specom.2024.103171","DOIUrl":"10.1016/j.specom.2024.103171","url":null,"abstract":"<div><div>Singing voice separation (SVS) from music accompaniment in audio tracks presents a significant challenge in music information retrieval, particularly when using single-channel inputs. Recent advancements in deep learning have introduced powerful models for this task, yet challenges remain in achieving high separation accuracy. This paper proposes a novel SVS method based on a hybrid dual-path network. Firstly, we propose a hybrid dual-path network for SVS tasks. This network combines Conformer and Transformer architectures, suitable for extracting both local and global features of musical signals in the waveform domain. Extensive experiments conducted on the Multimedia Information Retrieval Laboratory’s 1000 song clips dataset (MIR-1K) demonstrate that this network achieves state-of-the-art (SOTA) performance compared to existing networks. The network effectively captures the complex temporal dependencies of musical signals while maintaining high signal quality. Secondly, we employ a music data augmentation technique that significantly elevates the SVS performance of the proposed method. Thirdly, this study conducts an ablation study on loss functions, emphasizing the importance of using the scale-invariant signal-to-noise ratio (SI-SNR) loss function in waveform domain SVS tasks in enhancing separation performance. Finally, we conducted experiments on the MUSDB18 dataset and compared our method with other advanced models. Overall, the experimental results demonstrate that the proposed method exhibits robustness, adaptability, and enhanced performance in SVS tasks. This is particularly evident in maintaining audio fidelity and processing long temporal audio sequences.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"168 ","pages":"Article 103171"},"PeriodicalIF":2.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143179474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new universal camouflage attack algorithm for intelligent speech system 一种新的智能语音系统通用伪装攻击算法

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-11-25 DOI: 10.1016/j.specom.2024.103152

Dongzhu Rong , Qindong Sun , Yan Wang , Xiaoxiong Wang

{"title":"A new universal camouflage attack algorithm for intelligent speech system","authors":"Dongzhu Rong , Qindong Sun , Yan Wang , Xiaoxiong Wang","doi":"10.1016/j.specom.2024.103152","DOIUrl":"10.1016/j.specom.2024.103152","url":null,"abstract":"<div><div>The security problems in intelligent speech systems have been extensively investigated. In this paper, we propose a new more generic and more efficient attack method that can be used to promote the security of intelligent speech systems. By turning the attack target into a holistic intelligent speech system, we discovered security threats in speech resampling and risks in deep model normalization layers, and designed CA-ISS algorithms based on these two security risks. CA-ISS algorithm results in semantic content changes after speech resampling and launches an attack by creating cognitive differences between human and deep models. This paper also upgrades to the limitations of CA-ISS algorithm to attack the cloud platforms. Six intelligent speech systems are used to verify the effectiveness of CA-ISS in experiments. Experimental results demonstrate that the CA-ISS algorithm has sufficient generalisability, efficiency, and camouflage. Finally, the principle of CA-ISS algorithm is analyzed based on multiple visualization algorithms and the camouflage effect of the attack samples is evaluated.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103152"},"PeriodicalIF":2.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142744485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition 基于固定频率范围经验小波变换的语音情感识别声学和熵特征

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-11-14 DOI: 10.1016/j.specom.2024.103148

Siba Prasad Mishra, Pankaj Warule, Suman Deb

{"title":"Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition","authors":"Siba Prasad Mishra, Pankaj Warule, Suman Deb","doi":"10.1016/j.specom.2024.103148","DOIUrl":"10.1016/j.specom.2024.103148","url":null,"abstract":"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103148"},"PeriodicalIF":2.4,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection AFP-Conformer：用于欺骗性语音检测的渐进特征金字塔构形器

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-11-10 DOI: 10.1016/j.specom.2024.103149

Yida Huang, Qian Shen, Jianfen Ma

{"title":"AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection","authors":"Yida Huang, Qian Shen, Jianfen Ma","doi":"10.1016/j.specom.2024.103149","DOIUrl":"10.1016/j.specom.2024.103149","url":null,"abstract":"<div><div>The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103149"},"PeriodicalIF":2.4,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142661268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A robust temporal map of speech monitoring from planning to articulation 从规划到发音的强大语音监测时间图谱

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103146

Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist

{"title":"A robust temporal map of speech monitoring from planning to articulation","authors":"Lydia Dorokhova , Benjamin Morillon , Cristina Baus , Pascal Belin , Anne-Sophie Dubarry , F.-Xavier Alario , Elin Runnqvist","doi":"10.1016/j.specom.2024.103146","DOIUrl":"10.1016/j.specom.2024.103146","url":null,"abstract":"<div><div>Speakers continuously monitor their own speech to optimize fluent production, but the precise timing and underlying variables influencing speech monitoring remain insufficiently understood. Through two EEG experiments, this study aimed to provide a comprehensive temporal map of monitoring processes ranging from speech planning to articulation.</div><div>In both experiments, participants were primed to switch the consonant onsets of target word pairs read aloud, eliciting speech errors of either lexical or articulatory-phonetic (AP) origin. Experiment I used pairs of the same stimuli words, creating lexical or non-lexical errors when switching initial consonants, with the degree of shared AP features not fully balanced but considered in the analysis. Experiment II followed a similar methodology but used different words in pairs for the lexical and non-lexical conditions, fully orthogonalizing the number of shared AP features.</div><div>As error probability is higher in trials primed to result in lexical versus non-lexical errors and AP-close compared to AP-distant errors, more monitoring is required for these conditions. Similarly, error trials require more monitoring compared to correct trials. We used high versus low error probability on correct trials and errors versus correct trials as indices of monitoring.</div><div>Across both experiments, we observed that on correct trials, lexical error probability effects were present during initial stages of speech planning, while AP error probability effects emerged during speech motor preparation. In contrast, error trials showed differences from correct utterances in both early and late speech motor preparation and during articulation. These findings suggest that (a) response conflict on ultimately correct trials does not persist during articulation; (b) the timecourse of response conflict is restricted to the time window during which a given linguistic level is task-relevant (early on for response appropriateness-related variables and later for articulation-relevant variables); and (c) monitoring during the response is primarily triggered by pre-response monitoring failure. These results support that monitoring in language production is temporally distributed and rely on multiple mechanisms.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103146"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones 双语和音乐性对听者感知非母语词调的综合影响

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-11-01 DOI: 10.1016/j.specom.2024.103147

Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang

{"title":"The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones","authors":"Liang Zhang , Jiaqiang Zhu , Jing Shao , Caicai Zhang","doi":"10.1016/j.specom.2024.103147","DOIUrl":"10.1016/j.specom.2024.103147","url":null,"abstract":"<div><div>Non-native lexical tone perception can be affected by listeners’ musical or linguistic experience, but it remains unclear of whether there will be combined effects and how these impacts will be modulated by different types of non-native tones. This study adopted an orthogonal design with four participant groups, namely, Mandarin-L1 monolinguals and Mandarin-L1 and Cantonese-L2 bilinguals with or without musical training, to investigate effects of bilingualism and musicianship on perception of non-native lexical tones. The closely matched four groups, each encompassing an equal number of 20 participants, attended a modified ABX discrimination task of lexical tones of Teochew, which was unknown to all participants and consists of multiple tone types of level tones, contour tones, and checked tones. The tone perceptual sensitivity index of <em>d’</em> values and response times were calculated and compared using linear mixed-effects models. Results on tone sensitivity and response time revealed that all groups were more sensitive to contour tones than level tones, indicating the effect of native phonology of Mandarin tones on non-native tone perception. Besides, as compared to monolinguals, bilinguals obtained a higher <em>d’</em> value when discriminating non-native tones, and musically trained bilinguals responded faster than their non-musician peers. It indicates that bilinguals enjoy a perceptual advantage in non-native tone perception, with musicianship further enhancing this advantage. This extends prior studies by showing that an L2 with a more intricate tone inventory than L1 could facilitate listeners’ non-native tone perception. The pedagogical implications were discussed.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103147"},"PeriodicalIF":2.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142656235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners 评估连续音高和语音节奏修改对熟悉和不熟悉听者感知说话者验证性能的影响

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-10-23 DOI: 10.1016/j.specom.2024.103145

Benjamin O’Brien , Christine Meunier , Alain Ghio

{"title":"Evaluating the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar listeners","authors":"Benjamin O’Brien , Christine Meunier , Alain Ghio","doi":"10.1016/j.specom.2024.103145","DOIUrl":"10.1016/j.specom.2024.103145","url":null,"abstract":"<div><div>A study was conducted to evaluate the effects of continuous pitch and speech tempo modifications on perceptual speaker verification performance by familiar and unfamiliar naive listeners. Speech recordings made by twelve male, native-French speakers were organised into three groups of four (two in-set, one out-of-set). Two groups of listeners participated, where one group was familiar with one in-set speaker group, while both groups were unfamiliar with the remaining in- and out-of-set speaker groups. Pitch and speech tempo were continuously modified, such that the first 75% of words spoken were modified with percentages of modification beginning at 100% and decaying linearly to 0%. Pitch modifications began at <span><math><mo>±</mo></math></span> 600 cents, while speech tempo modifications started with word durations scaled 1:2 or 3:2. Participants evaluated a series of “go/no-go” task trials, where they were presented a modified speech recording with a face and tasked to respond as quickly as possible if they judged the stimuli to be continuous. The major findings revealed listeners overcame higher percentages of modification when presented familiar speaker stimuli. Familiar listeners outperformed unfamiliar listeners when evaluating continuously modified speech tempo stimuli, however, this effect was speaker-specific for pitch modified stimuli. Contrasting effects of modification direction were also observed. The findings suggest pitch is more useful to listeners when verifying familiar and unfamiliar voices.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103145"},"PeriodicalIF":2.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142527431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments 用于噪声语音实验的语言均衡的丹麦语句子视听记录语料库

IF 2.4 3区计算机科学

Speech Communication Pub Date : 2024-10-09 DOI: 10.1016/j.specom.2024.103141

Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May

{"title":"A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments","authors":"Abigail Anne Kressner , Kirsten Maria Jensen-Rico , Johannes Kizach , Brian Kai Loong Man , Anja Kofoed Pedersen , Lars Bramsløw , Lise Bruun Hansen , Laura Winther Balling , Brent Kirkwood , Tobias May","doi":"10.1016/j.specom.2024.103141","DOIUrl":"10.1016/j.specom.2024.103141","url":null,"abstract":"<div><div>A typical speech-in-noise experiment in a research and development setting can easily contain as many as 20 conditions, or even more, and often requires at least two test points per condition. A sentence test with enough sentences to make this amount of testing possible without repetition does not yet exist in Danish. Thus, a new corpus has been developed to facilitate the creation of a sentence test that is large enough to address this need. The corpus itself is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by two female and two male talkers. The sentences were constructed using a novel, template-based method that facilitated control over both word frequency and sentence structure. The sentences were evaluated linguistically in terms of phonemic distributions, naturalness, and connotation, and thereafter, recorded, postprocessed, and rated on their audio, visual, and pronunciation qualities. This paper describes in detail the methodology employed to create and characterize this corpus.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103141"},"PeriodicalIF":2.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0