Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2023-12-27 eCollection Date: 2023-12-01 DOI:10.1093/jamiaopen/ooad111

Zhou Lan, Alexander Turchin

{"title":"Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis.","authors":"Zhou Lan, Alexander Turchin","doi":"10.1093/jamiaopen/ooad111","DOIUrl":null,"url":null,"abstract":"Objective: To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies.Materials and methods: We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results.Results: Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from -21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation.Discussion: Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models.Conclusion: NLP errors are unlikely to affect the results of studies that use NLP as the source of data.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"6 4","pages":"ooad111"},"PeriodicalIF":2.5000,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10752385/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooad111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies.

Materials and methods: We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results.

Results: Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from -21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation.

Discussion: Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models.

Conclusion: NLP errors are unlikely to affect the results of studies that use NLP as the source of data.

查看原文本刊更多论文

自然语言处理衍生数据中可能出现的错误对下游流行病学分析的影响。

目的：评估自然语言处理（NLP）中的潜在错误对流行病学研究结果的影响：评估自然语言处理（NLP）中的潜在错误对流行病学研究结果的影响：我们利用了三项结果研究的数据，其中主要预测变量是通过 NLP 生成的。对于每项研究，我们都进行了蒙特卡罗模拟，以生成模拟 NLP 衍生变量潜在错误的数据集。随后，我们将原始回归模型与这些部分模拟数据集进行拟合，并将系数估计值的分布与原始研究结果进行比较：在评估的四个模型中，预测变量与结果之间关系的点估计值的平均变化范围在-21.9%到4.12%之间。在四个模型中的三个模型中，这种关系的显著性在 500 次模拟中没有一次被消除，而在一个模型中，这种关系的显著性在 12% 的模拟中被消除。混杂变量估计值的平均变化范围在 0.27% 到 2.27% 之间，在 0% 到 9.25% 的时间内消除了这种关系的显著性。没有变量的解释方向发生变化：讨论：模拟 NLP 误差对流行病学研究结果的影响不大，无论是 NLP 生成的变量还是模型中的其他变量，其效应估计值都只有很小的变化，对研究结果的解释（与结果相关性的方向和显著性）也没有变化：结论：NLP误差不太可能影响使用NLP作为数据来源的研究结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊