Speech Communication最新文献

筛选
英文 中文
Combined approach to dysarthric speaker verification using data augmentation and feature fusion 利用数据扩增和特征融合的组合方法验证发音障碍者
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-04-06 DOI: 10.1016/j.specom.2024.103070
Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad
{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim ,&nbsp;Syed Shahnawazuddin ,&nbsp;Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103070","url":null,"abstract":"<div><p>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both <em>i</em>-vector and <em>x</em>-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for <em>i</em>-vector models and a 23.66% reduction in EER for <em>x</em>-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":3.2,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An ensemble technique to predict Parkinson's disease using machine learning algorithms 利用机器学习算法预测帕金森病的集合技术
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103067
Nutan Singh, Priyanka Tripathi
{"title":"An ensemble technique to predict Parkinson's disease using machine learning algorithms","authors":"Nutan Singh,&nbsp;Priyanka Tripathi","doi":"10.1016/j.specom.2024.103067","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103067","url":null,"abstract":"<div><p>Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F<sub>1</sub>-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F<sub>1</sub>-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F<sub>1</sub>-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103067"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multimodal model for predicting feedback position and type during conversation 预测对话过程中反馈位置和类型的多模态模型
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103066
Auriane Boudin , Roxane Bertrand , Stéphane Rauzy , Magalie Ochs , Philippe Blache
{"title":"A multimodal model for predicting feedback position and type during conversation","authors":"Auriane Boudin ,&nbsp;Roxane Bertrand ,&nbsp;Stéphane Rauzy ,&nbsp;Magalie Ochs ,&nbsp;Philippe Blache","doi":"10.1016/j.specom.2024.103066","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103066","url":null,"abstract":"<div><p>This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.</p><p>This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: <em>positive/negative</em> and <em>given/new</em>. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.</p><p>The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103066"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000384/pdfft?md5=d3bb6a1d05cfbf539d30e718f252c2d8&pid=1-s2.0-S0167639324000384-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140331131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech intelligibility prediction using generalized ESTOI with fine-tuned parameters 使用带微调参数的广义 ESTOI 预测语音清晰度
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103068
Szymon Drgas
{"title":"Speech intelligibility prediction using generalized ESTOI with fine-tuned parameters","authors":"Szymon Drgas","doi":"10.1016/j.specom.2024.103068","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103068","url":null,"abstract":"<div><p>In this article, a lightweight and interpretable speech intelligibility prediction network is proposed. It is based on the ESTOI metric with several extensions: learned modulation filterbank, temporal attention, and taking into account robustness of a given reference recording. The proposed network is differentiable, and therefore it can be applied as a loss function in speech enhancement systems. The method was evaluated using the Clarity Prediction Challenge dataset. Compared to MB-STOI, the best of the systems proposed in this paper reduced RMSE from 28.01 to 21.33. It also outperformed best performing systems from the Clarity Challenge, while its training does not require additional labels like speech enhancement system and talker. It also has small memory and requirements, therefore, it can be potentially used as a loss function to train speech enhancement system. As it would consume less resources, the saved ones can be used for a larger speech enhancement neural network.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103068"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140540077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic speaker and age identification of children from raw speech using sincNet over ERB scale 在 ERB 标度上使用 sincNet 从原始语音中自动识别说话者和儿童年龄
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-04-01 DOI: 10.1016/j.specom.2024.103069
Kodali Radha , Mohan Bansal , Ram Bilas Pachori
{"title":"Automatic speaker and age identification of children from raw speech using sincNet over ERB scale","authors":"Kodali Radha ,&nbsp;Mohan Bansal ,&nbsp;Ram Bilas Pachori","doi":"10.1016/j.specom.2024.103069","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103069","url":null,"abstract":"<div><p>This paper presents the newly developed non-native children’s English speech (NNCES) corpus to reveal the findings of automatic speaker and age recognition from raw speech. Convolutional neural networks (CNN), which have the ability to learn low-level speech representations, can be fed directly with raw speech signals instead of using traditional hand-crafted features. Moreover, the filters that were learned using standard CNNs appeared to be noisy because they consider all elements of each filter. In contrast, sincNet can be able to generate more meaningful filters simply by replacing the first convolutional layer by a sinc-layer in standard CNNs. The low and high cutoff frequencies of the rectangular band-pass filter are the only parameters that can be learned in sincNet, which has the potential to extract significant speech cues from the speaker, such as pitch and formants. In this work, the sincNet model is significantly changed by switching from baseline Mel scale initializations to equivalent rectangular bandwidth (ERB) initializations, which has the added benefit of allocating additional filters in the lower region of the spectrum. Additionally, it needs to be highlighted that the novel sincNet model is well suited to identify the age of the children. The investigations on both read and spontaneous speech tasks in speaker identification, gender independent &amp; dependent age-group identification of children outperform the baseline models with varying relative improvements in terms of accuracy.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103069"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The effect of musical expertise on whistled vowel identification 音乐专业知识对口哨元音识别的影响
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-17 DOI: 10.1016/j.specom.2024.103058
Anaïs Tran Ngoc , Julien Meyer , Fanny Meunier
{"title":"The effect of musical expertise on whistled vowel identification","authors":"Anaïs Tran Ngoc ,&nbsp;Julien Meyer ,&nbsp;Fanny Meunier","doi":"10.1016/j.specom.2024.103058","DOIUrl":"10.1016/j.specom.2024.103058","url":null,"abstract":"<div><p>In this paper, we looked at the impact of musical experience on whistled vowel categorization by native French speakers. Whistled speech, a natural, yet modified speech type, augments speech amplitude while transposing the signal to a range of fairly high frequencies, i.e. 1 to 4 kHz. The whistled vowels are simple pitches of different heights depending on the vowel position, and generally represent the most stable part of the signal, just as in modal speech. They are modulated by consonant coarticulation(s), resulting in characteristic pitch movements. This change in speech mode can liken the speech signal to musical notes and their modulations; however, the mechanisms used to categorize whistled phonemes rely on abstract phonological knowledge and representation. Here we explore the impact of musical expertise on such a process by focusing on four whistled vowels (/i, e, a, o/) which have been used in previous experiments with non-musicians. We also included inter-speaker production variations, adding variability to the vowel pitches. Our results showed that all participants categorize whistled vowels well over chance, with musicians showing advantages for the middle whistled vowels (/a/ and /e/) as well as for the lower whistled vowel /o/. The whistler variability also affects musicians more than non-musicians and impacts their advantage, notably for the vowels /e/ and /o/. However, we find no specific training advantage for musicians over the whole experiment, but rather training effects for /a/ and /e/ when taking into account all participants. This suggests that though musical experience may help structure the vowel hierarchy when the whistler has a larger range, this advantage cannot be generalized when listening to another whistler. Thus, the transfer of musical knowledge present in this task only influences certain aspects of speech perception.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103058"},"PeriodicalIF":3.2,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Symmetric and asymmetric Gaussian weighted linear prediction for voice inverse filtering 用于语音反向滤波的对称和非对称高斯加权线性预测
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-06 DOI: 10.1016/j.specom.2024.103057
I.A. Zalazar, G.A. Alzamendi, G. Schlotthauer
{"title":"Symmetric and asymmetric Gaussian weighted linear prediction for voice inverse filtering","authors":"I.A. Zalazar,&nbsp;G.A. Alzamendi,&nbsp;G. Schlotthauer","doi":"10.1016/j.specom.2024.103057","DOIUrl":"10.1016/j.specom.2024.103057","url":null,"abstract":"<div><p>Weighted linear prediction (WLP) has demonstrated its significance in voice inverse filtering, contributing to enhanced methods for estimating both the vocal tract filter and the glottal source. WLP provides a mechanism to mitigate the effect on the linear prediction model of voice samples that affects the vocal tract filter estimation, particularly those samples around glottal closure instants (GCIs). This article studies the Gaussian weighted linear prediction (GLP) strategy, which employs a Gaussian attenuation window centered at the GCIs to reduce its contribution in the WLP analysis. In this study, the Gaussian attenuation is revisited and a parameterization of the window that adjusts to the typical variability in voice periodicity is introduced. In addition, an asymmetric Gaussian window is proposed to diminish the relevance of voice samples preceding GCIs on the WLP model, thus providing a quasi closed phase inverse filtering method. Characterization of symmetric and asymmetric GLP methods for glottal source estimation is addressed based on synthetic and natural phonation data, resulting in a set of optimal parameters for the Gaussian attenuation windows. The results show that the proposed asymmetric attenuation improves voice inverse filtering with respect to the symmetric GLP method. Comparisons with other state-of-the-art techniques suggest that the proposed GLP approaches are competitive, falling slightly short in performance only when contrasted with the well-known quasi closed inverse filtering analysis. The simplicity of implementing the attenuation windows, coupled with their robust performance, positions the proposed GLP methods as two attractive and straightforward voice inverse filtering techniques for practical application.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103057"},"PeriodicalIF":3.2,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Language fusion via adapters for low-resource speech recognition 通过适配器进行语言融合,实现低资源语音识别
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103037
Qing Hu, Yan Zhang, Xianlei Zhang, Zongyu Han, Xiuxia Liang
{"title":"Language fusion via adapters for low-resource speech recognition","authors":"Qing Hu,&nbsp;Yan Zhang,&nbsp;Xianlei Zhang,&nbsp;Zongyu Han,&nbsp;Xiuxia Liang","doi":"10.1016/j.specom.2024.103037","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103037","url":null,"abstract":"<div><p>Data scarcity makes low-resource speech recognition systems suffer from severe overfitting. Although fine-tuning addresses this issue to some extent, it leads to parameter-inefficient training. In this paper, a novel language knowledge fusion method, named LanFusion, is proposed. It is built on the recent popular adapter-tuning technique, thus maintaining better parameter efficiency compared with conventional fine-tuning methods. LanFusion is a two-stage method. Specifically, multiple adapters are first trained on several source languages to extract language-specific and language-invariant knowledge. Then, the trained adapters are re-trained on the target low-resource language to fuse the learned knowledge. Compared with Vanilla-adapter, LanFusion obtains a relative average word error rate (WER) reduction of 9.8% and 8.6% on the Common Voice and FLEURS corpora, respectively. Extensive experiments demonstrate the proposed method is not only simple and effective but also parameter-efficient. Besides, using source languages that are geographically similar to the target language yields better results on both datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103037"},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140030117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A distortionless convolution beamformer design method based on the weighted minimum mean square error for joint dereverberation and denoising 基于加权最小均方误差的无失真卷积波束成形器设计方法,用于联合消除混响和去噪
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103054
Jing Zhou, Changchun Bao, Maoshen Jia, Wenmeng Xiong
{"title":"A distortionless convolution beamformer design method based on the weighted minimum mean square error for joint dereverberation and denoising","authors":"Jing Zhou,&nbsp;Changchun Bao,&nbsp;Maoshen Jia,&nbsp;Wenmeng Xiong","doi":"10.1016/j.specom.2024.103054","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103054","url":null,"abstract":"<div><p>This paper designs a weighted minimum mean square error (WMMSE) based distortionless convolution beamformer (DCBF) for joint dereverberation and denoising. By effectively using WMMSE with the constraint of distortionless, a DCBF is deduced, where the outputs of the weighted prediction error (WPE) filter and the WPE-based minimum variance distortionless response (MVDR) beamformer are combined to initialize target signal for balancing signal distortion, residual reverberation and residual noise. In addition, two optimization factors are introduced to reduce the reverberation and noise when the initialized target signal is used for the solution of beamformer. As a result, the designed beamformer is presented as a linear combination of the WMMSE-based convolution beamformer (CBF) and weighted power minimization distortionless response (WPD) filter. The experimental results demonstrate the superior performance of the designed beamformer for joint dereverberation and denoising compared to the reference methods.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103054"},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139999875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of forced aligner performance on L2 English speech 第二语言英语语音的强制对齐器性能分析
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-03-01 DOI: 10.1016/j.specom.2024.103042
Samantha Williams, Paul Foulkes, Vincent Hughes
{"title":"Analysis of forced aligner performance on L2 English speech","authors":"Samantha Williams,&nbsp;Paul Foulkes,&nbsp;Vincent Hughes","doi":"10.1016/j.specom.2024.103042","DOIUrl":"10.1016/j.specom.2024.103042","url":null,"abstract":"<div><p>There is growing interest in how speech technologies perform on L2 speech. Largely omitted from this discussion are tools used in the early data processing steps, such as forced aligners, that can introduce errors and biases. This study adds to the conversation and tests how well a model pre-trained for the alignment of L1 American English speech performs on L2 English speech. We test and discuss the impact of language variety, demographic factors, and segment type on the performance of the forced aligner. We also examine systematic errors encountered.</p><p>Forty-five speakers representing nine L2 varieties were selected from the Speech Accent Archive and force aligned using the Montreal Forced Aligner. The phoneme-level boundary placements were manually corrected in order to assess differences between the automatic and manual alignments. Results show marked variation in the performance across language groups and segment types for the two metrics used to assess accuracy: Onset Boundary Displacement, a distance metric between the automatic and manual boundary placements, and Overlap Rate, which indicates to what extent the automatically aligned segment overlaps with the manually aligned segment. The highest accuracy on both measures was obtained for German and French, and lowest accuracy for Russian. The aligner's performance on all varieties was comparable to that on conversational American English and non-standard varieties of English. Furthermore, the percentage of boundary placements within 10 and 20 ms of the corrected boundary was similar to that observed between transcribers. Apart from errors due to variety mismatch, most issues encountered in the alignment were due to issues not exclusive to L2 speech such as inaccurate orthographic transcriptions, hesitations, specific voice qualities, and background noise.</p><p>The results of this study can inform the use of automatic aligners on L2 English speech and provide a baseline of potential errors and information to help the development of more robust alignment tools for further development of automatic systems using L2 English.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103042"},"PeriodicalIF":3.2,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000141/pdfft?md5=0ef6d8a9a8c0f2bf6466ba7d7a03e661&pid=1-s2.0-S0167639324000141-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139920646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信