多模态机器学习用于精神健康领域的语言和语音标记识别。

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-11-22 DOI:10.1186/s12911-024-02772-0

Georgios Drougkas, Erwin M Bakker, Marco Spruit

{"title":"多模态机器学习用于精神健康领域的语言和语音标记识别。","authors":"Georgios Drougkas, Erwin M Bakker, Marco Spruit","doi":"10.1186/s12911-024-02772-0","DOIUrl":null,"url":null,"abstract":"Background: There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.Methods: For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.Results: Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.Conclusions: In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"354"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583567/pdf/","citationCount":"0","resultStr":"{\"title\":\"Multimodal machine learning for language and speech markers identification in mental health.\",\"authors\":\"Georgios Drougkas, Erwin M Bakker, Marco Spruit\",\"doi\":\"10.1186/s12911-024-02772-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.Methods: For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.Results: Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.Conclusions: In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"24 1\",\"pages\":\"354\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583567/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02772-0\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02772-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：有许多论文关注使用单模态和多模态方法诊断精神疾病。然而，我们的文献综述显示，这些研究大多要么使用单模态方法诊断各种精神障碍，要么使用多模态方法诊断单一精神障碍。在这项研究中，我们将这两种方法结合起来，首先确定并汇编了一份广泛的精神疾病精神障碍标志物清单，这些标志物已被用于单模态和多模态方法，随后用于确定多模态方法是否优于单模态方法：在这项研究中，我们使用了从临床访谈中获得的众所周知的、稳健的多模态 DAIC-WOZ 数据集。在此，我们将重点放在文本和音频模式上。首先，我们构建了两个单模态模型，使用特征提取法分别分析文本和音频数据，这些特征提取法是基于我们利用相关研究和早期研究确定并汇编的大量精神障碍标记列表。对于我们的单模态文本模型，我们还提出了一个初步的实用二元标签创建过程。然后，我们采用了一种早期融合策略，在模型处理之前将文本和音频特征结合起来。然后，我们将融合后的特征集作为各种基准机器学习和深度学习算法的输入，包括支持向量机、逻辑回归、随机森林和全连接神经网络分类器（密集层）。最终，我们使用准确率、AUC-ROC 分数和两个 F1 指标对模型的性能进行了评估：一个指标用于预测阳性案例，另一个指标用于预测阴性案例：总体而言，单模态文本模型的准确率在 78% 到 87% 之间，AUC-ROC 分数在 85% 到 93% 之间，而单模态音频模型的准确率在 64% 到 72% 之间，AUC-ROC 分数在 53% 到 75% 之间。实验结果表明，我们的多模态模型达到了与单模态文本模型相当的准确率（从 80% 到 87%）和 AUC-ROC 分数（在 84% 到 93% 之间）。然而，大多数多模态模型的 F1 分数都超过了单模态模型，尤其是阳性类的 F1 分数（F1 为 1s），这反映了模型在识别标记物存在方面的表现：总之，通过完善二进制标签创建过程和改进单模态声学模型的特征工程过程，我们认为多模态模型可以超越两种单模态方法。这项研究强调了多模态整合在心理健康诊断领域的重要性，并为今后探索更复杂的融合技术和更深入的学习模型奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal machine learning for language and speech markers identification in mental health.

Background: There are numerous papers focusing on diagnosing mental health disorders using unimodal and multimodal approaches. However, our literature review shows that the majority of these studies either use unimodal approaches to diagnose a variety of mental disorders or employ multimodal approaches to diagnose a single mental disorder instead. In this research we combine these approaches by first identifying and compiling an extensive list of mental health disorder markers for a wide range of mental illnesses which have been used for both unimodal and multimodal methods, which is subsequently used for determining whether the multimodal approach can outperform the unimodal approaches.

Methods: For this study we used the well known and robust multimodal DAIC-WOZ dataset derived from clinical interviews. Here we focus on the modalities text and audio. First, we constructed two unimodal models to analyze text and audio data, respectively, using feature extraction, based on the extensive list of mental disorder markers that has been identified and compiled by us using related and earlier studies. For our unimodal text model, we also propose an initial pragmatic binary label creation process. Then, we employed an early fusion strategy to combine our text and audio features before model processing. Our fused feature set was then given as input to various baseline machine and deep learning algorithms, including Support Vector Machines, Logistic Regressions, Random Forests, and fully connected neural network classifiers (Dense Layers). Ultimately, the performance of our models was evaluated using accuracy, AUC-ROC score, and two F1 metrics: one for the prediction of positive cases and one for the prediction of negative cases.

Results: Overall, the unimodal text models achieved an accuracy ranging from 78% to 87% and an AUC-ROC score between 85% and 93%, while the unimodal audio models attained an accuracy of 64% to 72% and AUC-ROC scores of 53% to 75%. The experimental results indicated that our multimodal models achieved comparable accuracy (ranging from 80% to 87%) and AUC-ROC scores (between 84% and 93%) to those of the unimodal text models. However, the majority of the multimodal models managed to outperform the unimodal models in F1 scores, particularly in the F1 score of the positive class (F1 of 1s), which reflects how well the models perform in identifying the presence of a marker.

Conclusions: In conclusion, by refining the binary label creation process and by improving the feature engineering process of the unimodal acoustic model, we argue that the multimodal model can outperform both unimodal approaches. This study underscores the importance of multimodal integration in the field of mental health diagnostics and sets the stage for future research to explore more sophisticated fusion techniques and deeper learning models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.