Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data.

IF 3.4 2区医学 Q2 PSYCHIATRY

BMC Psychiatry Pub Date : 2024-07-24 DOI:10.1186/s12888-024-05954-2

Eoin McElroy, Thomas Wood, Raymond Bond, Maurice Mulvenna, Mark Shevlin, George B Ploubidis, Mauricio Scopel Hoffmann, Bettina Moltrecht

{"title":"Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data.","authors":"Eoin McElroy, Thomas Wood, Raymond Bond, Maurice Mulvenna, Mark Shevlin, George B Ploubidis, Mauricio Scopel Hoffmann, Bettina Moltrecht","doi":"10.1186/s12888-024-05954-2","DOIUrl":null,"url":null,"abstract":"Background: Pooling data from different sources will advance mental health research by providing larger sample sizes and allowing cross-study comparisons; however, the heterogeneity in how variables are measured across studies poses a challenge to this process.Methods: This study explored the potential of using natural language processing (NLP) to harmonise different mental health questionnaires by matching individual questions based on their semantic content. Using the Sentence-BERT model, we calculated the semantic similarity (cosine index) between 741 pairs of questions from five questionnaires. Drawing on data from a representative UK sample of adults (N = 2,058), we calculated a Spearman rank correlation for each of the same pairs of items, and then estimated the correlation between the cosine values and Spearman coefficients. We also used network analysis to explore the model's ability to uncover structures within the data and metadata.Results: We found a moderate overall correlation (r = .48, p < .001) between the two indices. In a holdout sample, the cosine scores predicted the real-world correlations with a small degree of error (MAE = 0.05, MedAE = 0.04, RMSE = 0.064) suggesting the utility of NLP in identifying similar items for cross-study data pooling. Our NLP model could detect more complex patterns in our data, however it required manual rules to decide which edges to include in the network.Conclusions: This research shows that it is possible to quantify the semantic similarity between pairs of questionnaire items from their meta-data, and these similarity indices correlate with how participants would answer the same two items. This highlights the potential of NLP to facilitate cross-study data pooling in mental health research. Nevertheless, researchers are cautioned to verify the psychometric equivalence of matched items.","PeriodicalId":9029,"journal":{"name":"BMC Psychiatry","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11267737/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Psychiatry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12888-024-05954-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Pooling data from different sources will advance mental health research by providing larger sample sizes and allowing cross-study comparisons; however, the heterogeneity in how variables are measured across studies poses a challenge to this process.

Methods: This study explored the potential of using natural language processing (NLP) to harmonise different mental health questionnaires by matching individual questions based on their semantic content. Using the Sentence-BERT model, we calculated the semantic similarity (cosine index) between 741 pairs of questions from five questionnaires. Drawing on data from a representative UK sample of adults (N = 2,058), we calculated a Spearman rank correlation for each of the same pairs of items, and then estimated the correlation between the cosine values and Spearman coefficients. We also used network analysis to explore the model's ability to uncover structures within the data and metadata.

Results: We found a moderate overall correlation (r = .48, p < .001) between the two indices. In a holdout sample, the cosine scores predicted the real-world correlations with a small degree of error (MAE = 0.05, MedAE = 0.04, RMSE = 0.064) suggesting the utility of NLP in identifying similar items for cross-study data pooling. Our NLP model could detect more complex patterns in our data, however it required manual rules to decide which edges to include in the network.

Conclusions: This research shows that it is possible to quantify the semantic similarity between pairs of questionnaire items from their meta-data, and these similarity indices correlate with how participants would answer the same two items. This highlights the potential of NLP to facilitate cross-study data pooling in mental health research. Nevertheless, researchers are cautioned to verify the psychometric equivalence of matched items.

查看原文本刊更多论文

使用自然语言处理促进心理健康问卷的统一：使用真实世界数据的验证研究。

背景：汇集来自不同来源的数据将提供更大的样本量并允许进行跨研究比较，从而推动心理健康研究的发展；然而，不同研究在测量变量方面的异质性给这一过程带来了挑战：本研究探索了使用自然语言处理（NLP）来协调不同心理健康调查问卷的可能性，方法是根据语义内容对各个问题进行匹配。利用句子-BERT模型，我们计算了五份问卷中741对问题之间的语义相似度（余弦指数）。根据英国具有代表性的成人样本数据（N = 2,058），我们计算了每对相同题目的斯皮尔曼等级相关性，然后估算了余弦值与斯皮尔曼系数之间的相关性。我们还使用网络分析来探索该模型揭示数据和元数据内部结构的能力：结果：我们发现总体相关性适中（r = .48，p 结论：数据和元数据之间的相关性较低：这项研究表明，从元数据中量化成对问卷项目之间的语义相似性是可能的，而且这些相似性指数与参与者如何回答相同的两个项目相关。这凸显了 NLP 在促进心理健康研究中跨研究数据汇集方面的潜力。尽管如此，研究人员仍需注意验证匹配项目的心理测量等效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Psychiatry 医学-精神病学

CiteScore

5.90

自引率

4.50%

发文量

716

审稿时长

3-6 weeks

期刊介绍： BMC Psychiatry is an open access, peer-reviewed journal that considers articles on all aspects of the prevention, diagnosis and management of psychiatric disorders, as well as related molecular genetics, pathophysiology, and epidemiology.