Speech Communication最新文献

筛选
英文 中文
The dependence of accommodation processes on conversational experience 调节过程对会话体验的依赖
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102963
L. Ann Burchfield, Mark Antoniou, Anne Cutler
{"title":"The dependence of accommodation processes on conversational experience","authors":"L. Ann Burchfield,&nbsp;Mark Antoniou,&nbsp;Anne Cutler","doi":"10.1016/j.specom.2023.102963","DOIUrl":"10.1016/j.specom.2023.102963","url":null,"abstract":"<div><p>Conversational partners accommodate to one another's speech, a process that greatly facilitates perception. This process occurs in both first (L1) and second languages (L2); however, recent research has revealed that adaptation can be language-specific, with listeners sometimes applying it in one language but not in another. Here, we investigate whether a supply of novel talkers impacts whether the adaptation is applied, testing Mandarin-English groups whose use of their two languages involves either an extensive or a restricted set of social situations. Perceptual learning in Mandarin and English is examined across two similarly-constituted groups in the same English-speaking environment: (a) heritage language users with Mandarin as family L1 and English as environmental language, and (b) international students with Mandarin as L1 and English as later-acquired L2. In English, exposure to an ambiguous sound in lexically disambiguating contexts prompted the expected retuning of phonemic boundaries in categorisation for the heritage users, but not for the students. In Mandarin, the opposite appeared: the heritage users showed no adaptation, but the students did adapt. In each case where learning did not appear, participants reported using the language in question with fewer interlocutors. The results support the view that successful retuning ability in any language requires regular conversational interaction with novel talkers.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47047534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech 空间和扬声器感知声学建模与有效的数据增强识别多阵列会话语音
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102958
Li Chai , Hang Chen , Jun Du , Qing-Feng Liu , Chin-Hui Lee
{"title":"Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech","authors":"Li Chai ,&nbsp;Hang Chen ,&nbsp;Jun Du ,&nbsp;Qing-Feng Liu ,&nbsp;Chin-Hui Lee","doi":"10.1016/j.specom.2023.102958","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102958","url":null,"abstract":"<div><p>We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers, which contain discriminative information from different speakers, including spatial information embedded in interaural phase differences (IPDs) between individual interfering speakers and the target speaker. In the proposed SSA-AM framework, we explore four acoustic model architectures consisting of different combinations of four neural networks, namely deep residual network, factorized time delay neural network, self-attention and residual bidirectional long short-term memory neural network. Various data augmentation techniques are adopted to expand the training data to include different options of beamformed speech obtained from multi-channel speech enhancement. Evaluated on the recent CHiME-6 Challenge Track 1, our proposed SSA-AM framework achieves consistent recognition performance improvements when compared with the official baseline acoustic models. Furthermore, SSA-AM outperforms acoustic models without explicitly using the space and speaker information. Finally, our data augmentation schemes are shown to be especially effective for compact model designs. Code is released at <span>https://github.com/coalboss/SSA_AM</span><svg><path></path></svg>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet 阿塞拜疆手语dactyl字母的混合单词识别系统和数据集的开发
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102960
Jamaladdin Hasanov , Nigar Alishzade , Aykhan Nazimzade , Samir Dadashzade , Toghrul Tahirov
{"title":"Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet","authors":"Jamaladdin Hasanov ,&nbsp;Nigar Alishzade ,&nbsp;Aykhan Nazimzade ,&nbsp;Samir Dadashzade ,&nbsp;Toghrul Tahirov","doi":"10.1016/j.specom.2023.102960","DOIUrl":"10.1016/j.specom.2023.102960","url":null,"abstract":"<div><p>The paper introduces a real-time fingerspelling-to-text translation system for the Azerbaijani Sign Language (AzSL), targeted to the clarification of the words with no available or ambiguous signs. The system consists of both statistical and probabilistic models, used in the sign recognition and sequence generation phases. Linguistic, technical, and <em>human–computer interaction</em>-related challenges, which are usually not considered in publicly available sign-based recognition application programming interfaces and tools, are addressed in this study. The specifics of the AzSL are reviewed, feature selection strategies are evaluated, and a robust model for the translation of hand signs is suggested. The two-stage recognition model exhibits high accuracy during real-time inference. Considering the lack of a publicly available dataset with the benchmark, a new, comprehensive AzSL dataset consisting of 13,444 samples collected by 221 volunteers is described and made publicly available for the sign language recognition community. To extend the dataset and make the model robust to changes, augmentation methods and their effect on the performance are analyzed. A lexicon-based validation method used for the probabilistic analysis and candidate word selection enhances the probability of the recognized phrases. Experiments delivered 94% accuracy on the test dataset, which was close to the real-time user experience. The dataset and implemented software are shared in a public repository for review and further research (CeDAR, 2021; Alishzade et al., 2022). The work has been presented at TeknoFest 2022 and ranked as the first in the category of <em>social-oriented technologies</em>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46498442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A new time–frequency representation based on the tight framelet packet for telephone-band speech coding 基于紧小帧包的电话频段语音编码时频表示方法
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102954
Souhir Bousselmi, Kaïs Ouni
{"title":"A new time–frequency representation based on the tight framelet packet for telephone-band speech coding","authors":"Souhir Bousselmi,&nbsp;Kaïs Ouni","doi":"10.1016/j.specom.2023.102954","DOIUrl":"10.1016/j.specom.2023.102954","url":null,"abstract":"<div><p>To improve the quality and intelligibility of telephone-band speech coding, a new time–frequency representation based on a tight framelet packet transform is proposed in this paper. In the context of speech coding, the effectiveness of this representation stems from its resilience to quantization noise, and reconstruction stability. Moreover, it offers a sub-band decomposition and good time–frequency localization according to the critical bands of the human ear. The coded signal is obtained using dynamic bit allocation and optimal quantization of normalized framelet coefficients. The performances of the corresponding method are compared to the critically sampled wavelet packet transform. Extensive simulation revealed that the proposed speech coding scheme, which incorporates the tight framelet packet transform performs better than that based on the critically sampled wavelet packet transform. Furthermore, it ensures a high bit-rate reduction with negligible degradation in speech quality. The proposed coder is found to outperform the standard telephone-band speech coders in term of objective measures and subjective evaluations including a formal listening test. The subjective quality of our codec at 4 kbps is almost identical to the reference G.711 codec operating at 64 kbps.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46455896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of virtual human sign language translation based on speech recognition 基于语音识别的虚拟人手语翻译应用
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.06.001
Xin Li , Shuying Yang, Haiming Guo
{"title":"Application of virtual human sign language translation based on speech recognition","authors":"Xin Li ,&nbsp;Shuying Yang,&nbsp;Haiming Guo","doi":"10.1016/j.specom.2023.06.001","DOIUrl":"10.1016/j.specom.2023.06.001","url":null,"abstract":"<div><p>For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (State-Space-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48223785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time intelligibility affects the realization of French word-final schwa 实时可理解性影响法语词尾弱读音的实现
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102962
Georgia Zellou , Ioana Chitoran , Ziqi Zhou
{"title":"Real-time intelligibility affects the realization of French word-final schwa","authors":"Georgia Zellou ,&nbsp;Ioana Chitoran ,&nbsp;Ziqi Zhou","doi":"10.1016/j.specom.2023.102962","DOIUrl":"10.1016/j.specom.2023.102962","url":null,"abstract":"<div><p>Speech variation has been hypothesized to reflect both speaker-internal influences of lexical access on production and adaptive modifications to make words more intelligible to the listener. The current study considers categorical and gradient variation in the production of word-final schwa in French as explained by lexical access processes, phonological, and/or listener-oriented influences on speech production, while controlling for other factors. To that end, native French speakers completed two laboratory production tasks. In Experiment 1, speakers produced 32 monosyllabic words varying in lexical frequency in a word list production task with no listener feedback. In Experiment 2, speakers produced the same words to an interlocutor while completing a map task varying listener comprehension success across trials: in half the trials, the words are correctly perceived by the interlocutor; in half, there is misunderstanding. Results reveal that speakers are more likely to produce word-final schwa when there is explicit pressure to be intelligible to the interlocutor. Also, when schwa is produced, it is longer preceding a consonant-initial word. Taken together, findings suggest that there are both phonological and clarity-oriented influences on word-final schwa realization in French.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45132784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Addressing the semi-open set dialect recognition problem under resource-efficient considerations 基于资源效率的半开放集方言识别问题研究
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102957
Spandan Dey, Goutam Saha
{"title":"Addressing the semi-open set dialect recognition problem under resource-efficient considerations","authors":"Spandan Dey,&nbsp;Goutam Saha","doi":"10.1016/j.specom.2023.102957","DOIUrl":"10.1016/j.specom.2023.102957","url":null,"abstract":"<div><p>This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with the three target dialects, utterances from other unknown classes are also included. We find that the top-performing submissions and the baseline system did not propose solutions that explicitly address the semi-open set scenario. This work pays special attention to the semi-open set nature of the problem and analyzes how the unknown utterances can potentially degrade the overall performance if not treated separately. We train our main dialect classifier with the ECAPA-TDNN architecture and 40-dimensional MFCC from the training data of three dialects. We propose a confidence-assessment algorithm and combine the TDNN performance from both end-to-end and embedding extractor approaches. We then frame the semi-open set scenario as a constrained optimization problem. By solving it, we prove that the performance degradation by the unknown utterances is minimized if the corresponding softmax prediction is equally confused among the target outputs. Based on this criterion, we develop different feedback modules in our system. These modules work on the novelty detection principles and flag unknown class utterances as anomaly. The prediction score of the corresponding utterance is then penalized by flattening. The proposed system achieves <span><math><mrow><msub><mrow><mi>C</mi></mrow><mrow><mi>avg</mi></mrow></msub><mrow><mo>(</mo><mo>×</mo><mn>100</mn><mo>)</mo></mrow></mrow></math></span> score of 8.50 and EER <span><math><mrow><mo>(</mo><mtext>%</mtext><mo>)</mo></mrow></math></span> of 9.77. Averaging both metrics, the score for our system outperforms the winning submission. Due to the proposed semi-open set adaptations, our system achieves this performance using much less training data and computation resources than the top-performing submissions. Additionally, to verify the broader applicability of the proposed semi-open set solution, we experiment with two other dialect recognition tasks covering English and Arabic languages and larger database sizes.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47720119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions 基于迭代自适应和动态掩码的多语言儿童语音提取
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102956
Shi Cheng , Jun Du , Shutong Niu , Alejandrina Cristia , Xin Wang , Qing Wang , Chin-Hui Lee
{"title":"Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions","authors":"Shi Cheng ,&nbsp;Jun Du ,&nbsp;Shutong Niu ,&nbsp;Alejandrina Cristia ,&nbsp;Xin Wang ,&nbsp;Qing Wang ,&nbsp;Chin-Hui Lee","doi":"10.1016/j.specom.2023.102956","DOIUrl":"10.1016/j.specom.2023.102956","url":null,"abstract":"<div><p>We develop two improvements over our previously-proposed joint enhancement and separation (JES) framework for child speech extraction in real-world multilingual scenarios. First, we introduce an iterative adaptation based separation (IAS) technique to iteratively fine-tune our pre-trained separation model in JES using data from real scenes to adapt the model. Second, to purify the training data, we propose a dynamic mask separation (DMS) technique with variable lengths in movable windows to locate meaningful speech segments using a scale-invariant signal-to-noise ratio (SI-SNR) objective. With DMS on top of IAS, called DMS+IAS, the combined technique can remove a large number of noise backgrounds and correctly locate speech regions in utterances recorded under real-world scenarios. Evaluated on the BabyTrain corpus, our proposed IAS system achieves consistent extraction performance improvements when compared to our previously-proposed JES framework. Moreover, experimental results also show that the proposed DMS+IAS technique can further improve the quality of separated child speech in real-world scenarios and obtain a relatively good extraction performance in difficult situations where adult speech is mixed with child speech.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41459371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusion-based speech emotion classification using two-stage feature selection 基于融合的两阶段特征选择语音情感分类
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102955
Jie Xie , Mingying Zhu , Kai Hu
{"title":"Fusion-based speech emotion classification using two-stage feature selection","authors":"Jie Xie ,&nbsp;Mingying Zhu ,&nbsp;Kai Hu","doi":"10.1016/j.specom.2023.102955","DOIUrl":"10.1016/j.specom.2023.102955","url":null,"abstract":"<div><p>Speech emotion recognition plays an important role in human–computer interaction, which uses speech signals to determine the emotional state. Previous studies have proposed various features and feature selection methods. However, few studies have investigated the two-stage feature selection method for speech emotion classification. In this study, we propose a novel speech emotion classification algorithm based on two-stage feature selection and two fusion strategies. Specifically, three types of features are extracted from speech signals: constant-Q spectrogram-based histogram of oriented gradients, openSMILE, and wavelet packet decomposition-based features. Then, two-stage feature selection using random forest and grey wolf optimization is applied to reduce feature dimension and model training time and improve the classification performance. In addition, both early and late fusion strategies are explored aiming to further improve the performance. Experimental results indicate that early fusion with two-stage feature selection can achieve the best performance. The highest classification accuracy for RAVDESS, SAVEE, EMOVO, and EmoDB is 86.97%, 88.79%, 89.24%, and 95.29%, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42570284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners 同一个体的多重语音障碍:调查手工特征、多标签分类算法和基础学习器
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102952
Sylvio Barbon Junior , Rodrigo Capobianco Guido , Gabriel Jonas Aguiar , Everton José Santana , Mario Lemes Proença Junior , Hemant A. Patil
{"title":"Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners","authors":"Sylvio Barbon Junior ,&nbsp;Rodrigo Capobianco Guido ,&nbsp;Gabriel Jonas Aguiar ,&nbsp;Everton José Santana ,&nbsp;Mario Lemes Proença Junior ,&nbsp;Hemant A. Patil","doi":"10.1016/j.specom.2023.102952","DOIUrl":"10.1016/j.specom.2023.102952","url":null,"abstract":"<div><p>Non-invasive acoustic analyses of voice disorders have been at the forefront of current biomedical research. Usual strategies, essentially based on machine learning (ML) algorithms, commonly classify a subject as being either healthy or pathologically-affected. Nevertheless, the latter state is not always a result of a sole laryngeal issue, i.e., multiple disorders might exist, demanding multi-label classification procedures for effective diagnoses. Consequently, the objective of this paper is to investigate the application of five multi-label classification methods based on problem transformation to play the role of base-learners, i.e., Label Powerset, Binary Relevance, Nested Stacking, Classifier Chains, and Dependent Binary Relevance with Random Forest (RF) and Support Vector Machine (SVM), in addition to a Deep Neural Network (DNN) from an algorithm adaptation method, to detect multiple voice disorders, i.e., Dysphonia, Laryngitis, Reinke’s Edema, Vox Senilis, and Central Laryngeal Motion Disorder. Receiving as input three handcrafted features, i.e., signal energy (SE), zero-crossing rates (ZCRs), and signal entropy (SH), which allow for interpretable descriptors in terms of speech analysis, production, and perception, we observed that the DNN-based approach powered with SE-based feature vectors presented the best values of F1-score among the tested methods, i.e., 0.943, as the averaged value from all the balancing scenarios, under Saarbrücken Voice Database (SVD) and considering 20% of balancing rate with Synthetic Minority Over-sampling Technique (SMOTE). Finally, our findings of most false negatives for laryngitis may explain the reason why its detection is a serious issue in speech technology. The results we report provide an original contribution, allowing for the consistent detection of multiple speech pathologies and advancing the state-of-the-art in the field of handcrafted acoustic-based non-invasive diagnosis of voice disorders.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44587953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信