Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection

IF 0.8 Q4 GASTROENTEROLOGY & HEPATOLOGY

Gastrointestinal disorders (Basel, Switzerland) Pub Date : 2023-03-31 DOI:10.3390/gidisord5020012

A. Fong, Justin M. Hughes, Sravya Gundapenini, Benjamin Hack, Mahdi Barkhordar, S. Huang, Adam J. Visconti, Stephen J Fernandez, D. Fishbein

{"title":"Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection","authors":"A. Fong, Justin M. Hughes, Sravya Gundapenini, Benjamin Hack, Mahdi Barkhordar, S. Huang, Adam J. Visconti, Stephen J Fernandez, D. Fishbein","doi":"10.3390/gidisord5020012","DOIUrl":null,"url":null,"abstract":"Evaluation of the United States Centers for Disease Control and Prevention (CDC)-defined HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. This work uses natural language processing (NLP) and machine learning to identify patients who are at high risk for HCV infection. Models were developed and validated to predict patients with newly identified HCV infection (detectable RNA or reported HCV diagnosis). We evaluated models with three types of variables: structured (structured-based model), semi-structured and free-text notes (text-based model), and all variables (full-set model). We applied each model to three stratifications of data: patients with no history of HCV prior to 2020, patients with a history of HCV prior to 2020, and all patients. We used XGBoost and ten-fold C-statistic cross-validation to evaluate the generalizability of the models. There were 3564 unique patients, 487 with HCV infection. The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744–0.810), 0.677 (95% CI: 0.631–0.723), and 0.774 (95% CI: 0.735–0.813), respectively. The full-set model performed slightly better than the structured-based model and similar to text-based models for patients with no history of HCV prior to 2020; average C-statistics of 0.780, 0.774, and 0.759, respectively. NLP was able to identify six more risk factors inconsistently coded in structured elements: incarceration, needlestick, substance use or abuse, sexually transmitted infections, piercings, and tattoos. The availability of model options (structured-based or text-based models) with a similar performance can provide deployment flexibility in situations where data is limited.","PeriodicalId":73131,"journal":{"name":"Gastrointestinal disorders (Basel, Switzerland)","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gastrointestinal disorders (Basel, Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/gidisord5020012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Evaluation of the United States Centers for Disease Control and Prevention (CDC)-defined HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. This work uses natural language processing (NLP) and machine learning to identify patients who are at high risk for HCV infection. Models were developed and validated to predict patients with newly identified HCV infection (detectable RNA or reported HCV diagnosis). We evaluated models with three types of variables: structured (structured-based model), semi-structured and free-text notes (text-based model), and all variables (full-set model). We applied each model to three stratifications of data: patients with no history of HCV prior to 2020, patients with a history of HCV prior to 2020, and all patients. We used XGBoost and ten-fold C-statistic cross-validation to evaluate the generalizability of the models. There were 3564 unique patients, 487 with HCV infection. The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744–0.810), 0.677 (95% CI: 0.631–0.723), and 0.774 (95% CI: 0.735–0.813), respectively. The full-set model performed slightly better than the structured-based model and similar to text-based models for patients with no history of HCV prior to 2020; average C-statistics of 0.780, 0.774, and 0.759, respectively. NLP was able to identify six more risk factors inconsistently coded in structured elements: incarceration, needlestick, substance use or abuse, sexually transmitted infections, piercings, and tattoos. The availability of model options (structured-based or text-based models) with a similar performance can provide deployment flexibility in situations where data is limited.

查看原文本刊更多论文

结构化、半结构化和自由文本电子健康记录数据对丙型肝炎病毒（HCV）感染分类的评估

美国疾病控制与预防中心（CDC）定义的HCV相关风险因素的评估并没有作为常规护理的一部分持续进行，这使得基于风险的检测容易受到临床医生偏见和漏诊的影响。这项工作使用自然语言处理（NLP）和机器学习来识别HCV感染的高危患者。开发并验证了模型，以预测新发现的HCV感染患者（可检测的RNA或报告的HCV诊断）。我们评估了具有三种类型变量的模型：结构化（基于结构化的模型）、半结构化和自由文本注释（基于文本的模型）以及所有变量（全套模型）。我们将每个模型应用于三个数据分层：2020年之前没有丙型肝炎病史的患者、2020年之前有丙型肝炎史的患者和所有患者。我们使用XGBoost和十倍C统计交叉验证来评估模型的可推广性。共有3564名独特的患者，其中487人感染了丙型肝炎病毒。所有患者基于结构化、基于文本和全套模型的平均C统计量分别为0.777（95%CI:0.744-0.810）、0.677（95%CI=0.631-0.723）和0.774（95%CI:0.735-0.813）。对于2020年之前没有HCV病史的患者，全套模型的表现略好于基于结构化的模型，类似于基于文本的模型；平均C统计量分别为0.780、0.774和0.759。NLP能够识别出另外六个结构元素编码不一致的风险因素：监禁、针刺、药物使用或滥用、性传播感染、穿孔和纹身。具有类似性能的模型选项（基于结构化或基于文本的模型）的可用性可以在数据有限的情况下提供部署灵活性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊