A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

arXiv - CS - Sound Pub Date : 2024-05-21 DOI:arxiv-2405.17206

Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque

{"title":"A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings","authors":"Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque","doi":"arxiv-2405.17206","DOIUrl":null,"url":null,"abstract":"We present a framework to recognize Parkinson's disease (PD) through an\nEnglish pangram utterance speech collected using a web application from diverse\nrecording settings and environments, including participants' homes. Our dataset\nincludes a global cohort of 1306 participants, including 392 diagnosed with PD.\nLeveraging the diversity of the dataset, spanning various demographic\nproperties (such as age, sex, and ethnicity), we used deep learning embeddings\nderived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind\nrepresenting the speech dynamics associated with PD. Our novel fusion model for\nPD classification, which aligns different speech embeddings into a cohesive\nfeature space, demonstrated superior performance over standard\nconcatenation-based fusion models and other baselines (including models built\non traditional acoustic features). In a randomized data split configuration,\nthe model achieved an Area Under the Receiver Operating Characteristic Curve\n(AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis\nconfirmed that our model performs equitably across various demographic\nsubgroups in terms of sex, ethnicity, and age, and remains robust regardless of\ndisease duration. Furthermore, our model, when tested on two entirely unseen\ntest datasets collected from clinical settings and from a PD care center,\nmaintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the\nmodel's robustness and it's potential to enhance accessibility and health\nequity in real-world applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"48 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.17206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We present a framework to recognize Parkinson's disease (PD) through an English pangram utterance speech collected using a web application from diverse recording settings and environments, including participants' homes. Our dataset includes a global cohort of 1306 participants, including 392 diagnosed with PD. Leveraging the diversity of the dataset, spanning various demographic properties (such as age, sex, and ethnicity), we used deep learning embeddings derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind representing the speech dynamics associated with PD. Our novel fusion model for PD classification, which aligns different speech embeddings into a cohesive feature space, demonstrated superior performance over standard concatenation-based fusion models and other baselines (including models built on traditional acoustic features). In a randomized data split configuration, the model achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis confirmed that our model performs equitably across various demographic subgroups in terms of sex, ethnicity, and age, and remains robust regardless of disease duration. Furthermore, our model, when tested on two entirely unseen test datasets collected from clinical settings and from a PD care center, maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the model's robustness and it's potential to enhance accessibility and health equity in real-world applications.

查看原文本刊更多论文

一种利用半监督语音嵌入进行 PD 检测的新型融合架构

我们提出了一个框架，通过使用网络应用程序从不同的记录设置和环境（包括参与者的家庭）中收集的英语泛型语篇语音来识别帕金森病（PD）。我们的数据集包括全球 1306 名参与者，其中 392 人被诊断为帕金森病。利用数据集的多样性，涵盖各种人口统计属性（如年龄、性别和种族），我们使用了从半监督模型（如 Wav2Vec 2.0、WavLM 和 ImageBind）中提取的深度学习嵌入，这些模型代表了与帕金森病相关的语音动态。我们用于 PD 分类的新型融合模型将不同的语音嵌入对齐到一个具有凝聚力的特征空间中，其性能优于基于标准嵌入的融合模型和其他基线（包括传统声学特征模型）。在随机数据分割配置中，该模型的接收者工作特征曲线下面积（AUROC）达到了 88.94%，准确率达到了 85.65%。严格的统计分析证实，我们的模型在不同性别、种族和年龄的人口统计学分组中表现公平，并且无论疾病持续时间长短都保持稳健。此外，我们的模型在两个完全未经测试的数据集上进行了测试，这两个数据集分别来自临床环境和一个帕金森病护理中心，其AUROC得分分别为82.12%和78.44%。这证明了该模型的稳健性，以及在实际应用中提高可及性和治疗公平性的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量