Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis.

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES
JAMIA Open Pub Date : 2023-12-27 eCollection Date: 2023-12-01 DOI:10.1093/jamiaopen/ooad111
Zhou Lan, Alexander Turchin
{"title":"Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis.","authors":"Zhou Lan, Alexander Turchin","doi":"10.1093/jamiaopen/ooad111","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies.</p><p><strong>Materials and methods: </strong>We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results.</p><p><strong>Results: </strong>Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from -21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation.</p><p><strong>Discussion: </strong>Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models.</p><p><strong>Conclusion: </strong>NLP errors are unlikely to affect the results of studies that use NLP as the source of data.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":null,"pages":null},"PeriodicalIF":2.5000,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10752385/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooad111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies.

Materials and methods: We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results.

Results: Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from -21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation.

Discussion: Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models.

Conclusion: NLP errors are unlikely to affect the results of studies that use NLP as the source of data.

自然语言处理衍生数据中可能出现的错误对下游流行病学分析的影响。
目的:评估自然语言处理(NLP)中的潜在错误对流行病学研究结果的影响:评估自然语言处理(NLP)中的潜在错误对流行病学研究结果的影响:我们利用了三项结果研究的数据,其中主要预测变量是通过 NLP 生成的。对于每项研究,我们都进行了蒙特卡罗模拟,以生成模拟 NLP 衍生变量潜在错误的数据集。随后,我们将原始回归模型与这些部分模拟数据集进行拟合,并将系数估计值的分布与原始研究结果进行比较:在评估的四个模型中,预测变量与结果之间关系的点估计值的平均变化范围在-21.9%到4.12%之间。在四个模型中的三个模型中,这种关系的显著性在 500 次模拟中没有一次被消除,而在一个模型中,这种关系的显著性在 12% 的模拟中被消除。混杂变量估计值的平均变化范围在 0.27% 到 2.27% 之间,在 0% 到 9.25% 的时间内消除了这种关系的显著性。没有变量的解释方向发生变化:讨论:模拟 NLP 误差对流行病学研究结果的影响不大,无论是 NLP 生成的变量还是模型中的其他变量,其效应估计值都只有很小的变化,对研究结果的解释(与结果相关性的方向和显著性)也没有变化:结论:NLP误差不太可能影响使用NLP作为数据来源的研究结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信