利用深度扬声器嵌入模型进行集合分类，自动识别跨语言和多语言发音障碍

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems Pub Date : 2024-06-12 DOI:10.1111/exsy.13660

Dosti Aziz, Dávid Sztahó

{"title":"利用深度扬声器嵌入模型进行集合分类，自动识别跨语言和多语言发音障碍","authors":"Dosti Aziz, Dávid Sztahó","doi":"10.1111/exsy.13660","DOIUrl":null,"url":null,"abstract":"<p>Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross- and multi-lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x-vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x-vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language-independent manner, forming fixed-dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x-vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross-lingual and multi-lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language-independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross- and multi-lingual scenarios.</p>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"41 10","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/exsy.13660","citationCount":"0","resultStr":"{\"title\":\"Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models\",\"authors\":\"Dosti Aziz, Dávid Sztahó\",\"doi\":\"10.1111/exsy.13660\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross- and multi-lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x-vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x-vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language-independent manner, forming fixed-dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x-vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross-lingual and multi-lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language-independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross- and multi-lingual scenarios.</p>\",\"PeriodicalId\":51053,\"journal\":{\"name\":\"Expert Systems\",\"volume\":\"41 10\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/exsy.13660\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13660\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13660","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）算法在使用语音样本进行发音障碍检测方面表现出色。然而，当在与训练数据不同的语言上进行测试时，这些算法的功效往往会减弱，从而引发了这些算法在临床环境中是否适用的问题。本研究旨在开发一种稳健的跨语言和多语言发音障碍检测方法，以克服现有 ML 方法中语言依赖性的限制。我们提出了一种创新方法，利用说话人验证模型（尤其是 ECAPA 和 x-vector）中的语音嵌入，并采用多数投票集合分类器。我们利用从 ECAPA 和 x-vector 嵌入中提取的语音特征来训练三种不同的分类器。这些嵌入模型的显著优势在于它们能够以与语言无关的方式捕捉说话者的特征，形成固定维度的特征空间。此外，我们还利用合成少数群体过采样技术（SMOTE）研究了在嵌入特征空间内生成合成数据的影响。我们的实验结果揭示了所提方法在发音障碍检测中的有效性。与 x 向量嵌入的结果相比，ECAPA 在区分健康语音和发音障碍语音方面始终表现出卓越的性能，在跨语言和多语言场景中的准确率分别达到 93.33% 和 96.55%。这凸显了说话人验证模型，尤其是 ECAPA，在捕捉语言无关特征以提高整体检测性能方面的卓越能力。所提出的方法有效地解决了发音障碍检测中语言依赖性的难题。ECAPA 嵌入与多数投票集合分类器相结合，在提高跨语言和多语言场景中发音障碍检测的准确性和可靠性方面显示出巨大的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

查看原文本刊更多论文

Automatic cross- and multi-lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross- and multi-lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x-vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x-vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language-independent manner, forming fixed-dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x-vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross-lingual and multi-lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language-independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross- and multi-lingual scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.