用mfccc和卷积神经网络计算平衡。

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2024-12-31 eCollection Date: 2024-01-01 DOI:10.1371/journal.pone.0315452

Andrés Lozano, Enrique Nava, María Dolores García Méndez, Ignacio Moreno-Torres

{"title":"用mfccc和卷积神经网络计算平衡。","authors":"Andrés Lozano, Enrique Nava, María Dolores García Méndez, Ignacio Moreno-Torres","doi":"10.1371/journal.pone.0315452","DOIUrl":null,"url":null,"abstract":"Nasalance is a valuable clinical biomarker for hypernasality. It is computed as the ratio of acoustic energy emitted through the nose to the total energy emitted through the mouth and nose (eNasalance). A new approach is proposed to compute nasalance using Convolutional Neural Networks (CNNs) trained with Mel-Frequency Cepstrum Coefficients (mfccNasalance). mfccNasalance is evaluated by examining its accuracy: 1) when the train and test data are from the same or from different dialects; 2) with test data that differs in dynamicity (e.g. rapidly produced diadochokinetic syllables versus short words); and 3) using multiple CNN configurations (i.e. kernel shape and use of 1 × 1 pointwise convolution). Dual-channel Nasometer speech data from healthy speakers from different dialects: Costa Rica, more(+) nasal, Spain and Chile, less(-) nasal, are recorded. The input to the CNN models were sequences of 39 MFCC vectors computed from 250 ms moving windows. The test data were recorded in Spain and included short words (-dynamic), sentences (+dynamic), and diadochokinetic syllables (+dynamic). The accuracy of a CNN model was defined as the Spearman correlation between the mfccNasalance for that model and the perceptual nasality scores of human experts. In the same-dialect condition, mfccNasalance was more accurate than eNasalance independently of the CNN configuration; using a 1 × 1 kernel resulted in increased accuracy for +dynamic utterances (p < .000), though not for -dynamic utterances. The kernel shape had a significant impact for -dynamic utterances (p < .000) exclusively. In the different-dialect condition, the scores were significantly less accurate than in the same-dialect condition, particularly for Costa Rica trained models. We conclude that mfccNasalance is a flexible and useful alternative to eNasalance. Future studies should explore how to optimize mfccNasalance by selecting the most adequate CNN model as a function of the dynamicity of the target speech data.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"19 12","pages":"e0315452"},"PeriodicalIF":2.6000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687758/pdf/","citationCount":"0","resultStr":"{\"title\":\"Computing nasalance with MFCCs and Convolutional Neural Networks.\",\"authors\":\"Andrés Lozano, Enrique Nava, María Dolores García Méndez, Ignacio Moreno-Torres\",\"doi\":\"10.1371/journal.pone.0315452\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nasalance is a valuable clinical biomarker for hypernasality. It is computed as the ratio of acoustic energy emitted through the nose to the total energy emitted through the mouth and nose (eNasalance). A new approach is proposed to compute nasalance using Convolutional Neural Networks (CNNs) trained with Mel-Frequency Cepstrum Coefficients (mfccNasalance). mfccNasalance is evaluated by examining its accuracy: 1) when the train and test data are from the same or from different dialects; 2) with test data that differs in dynamicity (e.g. rapidly produced diadochokinetic syllables versus short words); and 3) using multiple CNN configurations (i.e. kernel shape and use of 1 × 1 pointwise convolution). Dual-channel Nasometer speech data from healthy speakers from different dialects: Costa Rica, more(+) nasal, Spain and Chile, less(-) nasal, are recorded. The input to the CNN models were sequences of 39 MFCC vectors computed from 250 ms moving windows. The test data were recorded in Spain and included short words (-dynamic), sentences (+dynamic), and diadochokinetic syllables (+dynamic). The accuracy of a CNN model was defined as the Spearman correlation between the mfccNasalance for that model and the perceptual nasality scores of human experts. In the same-dialect condition, mfccNasalance was more accurate than eNasalance independently of the CNN configuration; using a 1 × 1 kernel resulted in increased accuracy for +dynamic utterances (p < .000), though not for -dynamic utterances. The kernel shape had a significant impact for -dynamic utterances (p < .000) exclusively. In the different-dialect condition, the scores were significantly less accurate than in the same-dialect condition, particularly for Costa Rica trained models. We conclude that mfccNasalance is a flexible and useful alternative to eNasalance. Future studies should explore how to optimize mfccNasalance by selecting the most adequate CNN model as a function of the dynamicity of the target speech data.\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"19 12\",\"pages\":\"e0315452\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687758/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0315452\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0315452","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

鼻平衡是一种有价值的临床生物标志物。它被计算为通过鼻子发射的声能与通过嘴和鼻子发射的总能量的比率（eNasalance）。提出了一种利用Mel-Frequency倒频谱系数（mfccNasalance）训练卷积神经网络（cnn）计算鼻平衡的新方法。mfccNasalance通过检查其准确性来评估：1)当训练和测试数据来自相同或不同的方言时；2)具有不同动态的测试数据（例如，快速生成的双代动力学音节与短单词）；3)使用多种CNN配置（即核形状和使用1 × 1点向卷积）。记录来自不同方言的健康说话者的双通道鼻音测量语音数据：哥斯达黎加，多（+）鼻音，西班牙和智利，少（-）鼻音。CNN模型的输入是由250 ms移动窗口计算的39个MFCC矢量序列。测试数据用西班牙语记录，包括短单词（-dynamic）、句子（+dynamic）和双动音节（+dynamic）。CNN模型的准确性被定义为该模型的mfcnasalance与人类专家的感知鼻音得分之间的Spearman相关性。在相同方言条件下，与CNN配置无关，mfccNasalance比eNasalance更准确；使用1 × 1内核可以提高+动态话语的准确性（p < .000），但对于-动态话语则没有提高。核形仅对非动态话语有显著影响（p < .000）。在不同方言条件下，得分明显低于相同方言条件下的准确性，特别是哥斯达黎加训练的模型。我们得出结论，mfccNasalance是一个灵活和有用的替代eNasalance。未来的研究应该探索如何通过选择最合适的CNN模型作为目标语音数据动态的函数来优化mfccNasalance。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Computing nasalance with MFCCs and Convolutional Neural Networks.

查看原文本刊更多论文

Computing nasalance with MFCCs and Convolutional Neural Networks.

Nasalance is a valuable clinical biomarker for hypernasality. It is computed as the ratio of acoustic energy emitted through the nose to the total energy emitted through the mouth and nose (eNasalance). A new approach is proposed to compute nasalance using Convolutional Neural Networks (CNNs) trained with Mel-Frequency Cepstrum Coefficients (mfccNasalance). mfccNasalance is evaluated by examining its accuracy: 1) when the train and test data are from the same or from different dialects; 2) with test data that differs in dynamicity (e.g. rapidly produced diadochokinetic syllables versus short words); and 3) using multiple CNN configurations (i.e. kernel shape and use of 1 × 1 pointwise convolution). Dual-channel Nasometer speech data from healthy speakers from different dialects: Costa Rica, more(+) nasal, Spain and Chile, less(-) nasal, are recorded. The input to the CNN models were sequences of 39 MFCC vectors computed from 250 ms moving windows. The test data were recorded in Spain and included short words (-dynamic), sentences (+dynamic), and diadochokinetic syllables (+dynamic). The accuracy of a CNN model was defined as the Spearman correlation between the mfccNasalance for that model and the perceptual nasality scores of human experts. In the same-dialect condition, mfccNasalance was more accurate than eNasalance independently of the CNN configuration; using a 1 × 1 kernel resulted in increased accuracy for +dynamic utterances (p < .000), though not for -dynamic utterances. The kernel shape had a significant impact for -dynamic utterances (p < .000) exclusively. In the different-dialect condition, the scores were significantly less accurate than in the same-dialect condition, particularly for Costa Rica trained models. We conclude that mfccNasalance is a flexible and useful alternative to eNasalance. Future studies should explore how to optimize mfccNasalance by selecting the most adequate CNN model as a function of the dynamicity of the target speech data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage