Towards artificial intelligence-based disease prediction algorithms that comprehensively leverage and continuously learn from real-world clinical tabular data systems.

IF 7.7

PLOS digital health Pub Date : 2024-09-03 eCollection Date: 2024-09-01 DOI:10.1371/journal.pdig.0000589

Terrence J Lee-St John, Oshin Kanwar, Emna Abidi, Wasim El Nekidy, Bartlomiej Piechowski-Jozwiak

{"title":"Towards artificial intelligence-based disease prediction algorithms that comprehensively leverage and continuously learn from real-world clinical tabular data systems.","authors":"Terrence J Lee-St John, Oshin Kanwar, Emna Abidi, Wasim El Nekidy, Bartlomiej Piechowski-Jozwiak","doi":"10.1371/journal.pdig.0000589","DOIUrl":null,"url":null,"abstract":"<p><p>This manuscript presents a proof-of-concept for a generalizable strategy, the full algorithm, designed to estimate disease risk using real-world clinical tabular data systems, such as electronic health records (EHR) or claims databases. By integrating classic statistical methods and modern artificial intelligence techniques, this strategy automates the production of a disease prediction model that comprehensively reflects the dynamics contained within the underlying data system. Specifically, the full algorithm parses through every facet of the data (e.g., encounters, diagnoses, procedures, medications, labs, chief complaints, flowsheets, vital signs, demographics, etc.), selects which factors to retain as predictor variables by evaluating the data empirically against statistical criteria, structures and formats the retained data into time-series, trains a neural network-based prediction model, then subsequently applies this model to current patients to generate risk estimates. A distinguishing feature of the proposed strategy is that it produces a self-adaptive prediction system, capable of evolving the prediction mechanism in response to changes within the data: as newly collected data expand/modify the dataset organically, the prediction mechanism automatically evolves to reflect these changes. Moreover, the full algorithm operates without the need for a-priori data curation and aims to harness all informative risk and protective factors within the real-world data. This stands in contrast to traditional approaches, which often rely on highly curated datasets and domain expertise to build static prediction models based solely on well-known risk factors. As a proof-of-concept, we codified the full algorithm and tasked it with estimating 12-month risk of initial stroke or myocardial infarction using our hospital's real-world EHR. A 66-month pseudo-prospective validation was conducted using records from 558,105 patients spanning April 2015 to September 2023, totalling 3,424,060 patient-months. Area under the receiver operating characteristic curve (AUROC) values ranged from .830 to .909, with an improving trend over time. Odds ratios describing model precision for patients 1-100 and 101-200 (when ranked by estimated risk) ranged from 15.3 to 48.1 and 7.2 to 45.0, respectively, with both groups showing improving trends over time. Findings suggest the feasibility of developing high-performing disease risk calculators in the proposed manner.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 9","pages":"e0000589"},"PeriodicalIF":7.7000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371204/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This manuscript presents a proof-of-concept for a generalizable strategy, the full algorithm, designed to estimate disease risk using real-world clinical tabular data systems, such as electronic health records (EHR) or claims databases. By integrating classic statistical methods and modern artificial intelligence techniques, this strategy automates the production of a disease prediction model that comprehensively reflects the dynamics contained within the underlying data system. Specifically, the full algorithm parses through every facet of the data (e.g., encounters, diagnoses, procedures, medications, labs, chief complaints, flowsheets, vital signs, demographics, etc.), selects which factors to retain as predictor variables by evaluating the data empirically against statistical criteria, structures and formats the retained data into time-series, trains a neural network-based prediction model, then subsequently applies this model to current patients to generate risk estimates. A distinguishing feature of the proposed strategy is that it produces a self-adaptive prediction system, capable of evolving the prediction mechanism in response to changes within the data: as newly collected data expand/modify the dataset organically, the prediction mechanism automatically evolves to reflect these changes. Moreover, the full algorithm operates without the need for a-priori data curation and aims to harness all informative risk and protective factors within the real-world data. This stands in contrast to traditional approaches, which often rely on highly curated datasets and domain expertise to build static prediction models based solely on well-known risk factors. As a proof-of-concept, we codified the full algorithm and tasked it with estimating 12-month risk of initial stroke or myocardial infarction using our hospital's real-world EHR. A 66-month pseudo-prospective validation was conducted using records from 558,105 patients spanning April 2015 to September 2023, totalling 3,424,060 patient-months. Area under the receiver operating characteristic curve (AUROC) values ranged from .830 to .909, with an improving trend over time. Odds ratios describing model precision for patients 1-100 and 101-200 (when ranked by estimated risk) ranged from 15.3 to 48.1 and 7.2 to 45.0, respectively, with both groups showing improving trends over time. Findings suggest the feasibility of developing high-performing disease risk calculators in the proposed manner.

Abstract Image

查看原文本刊更多论文

开发基于人工智能的疾病预测算法，全面利用并不断学习真实世界的临床表格数据系统。

本手稿介绍了一种可推广策略的概念验证，即完整算法，旨在利用真实世界的临床表格数据系统（如电子健康记录（EHR）或理赔数据库）估算疾病风险。通过整合经典统计方法和现代人工智能技术，该策略可自动生成疾病预测模型，全面反映底层数据系统中的动态变化。具体来说，完整的算法会解析数据的方方面面（如就诊、诊断、手术、用药、化验、主诉、流程单、生命体征、人口统计学等），通过根据统计标准对数据进行经验评估，选择哪些因素作为预测变量，将保留的数据结构化并格式化为时间序列，训练基于神经网络的预测模型，然后将该模型应用于当前患者，生成风险估计值。所提策略的一个显著特点是，它能产生一个自适应预测系统，能够根据数据的变化发展预测机制：当新收集的数据有机地扩展/修改数据集时，预测机制会自动发展以反映这些变化。此外，完整算法的运行不需要先验数据整理，旨在利用真实世界数据中的所有信息性风险和保护因素。这与传统方法形成了鲜明对比，传统方法通常依赖于高度策划的数据集和领域专业知识，仅根据众所周知的风险因素建立静态预测模型。作为概念验证，我们编纂了完整的算法，并让它利用本医院的真实 EHR 估算 12 个月内初次中风或心肌梗死的风险。我们使用从 2015 年 4 月到 2023 年 9 月的 558,105 名患者的记录进行了为期 66 个月的伪前瞻性验证，总计 3424,060 个患者月。接收者操作特征曲线下面积 (AUROC) 值从 0.830 到 0.909 不等，随着时间的推移呈上升趋势。描述 1-100 例和 101-200 例患者（按估计风险排序）模型精确度的比值比分别为 15.3 至 48.1 和 7.2 至 45.0，随着时间的推移，两组患者的比值比均呈上升趋势。研究结果表明，以建议的方式开发高性能疾病风险计算器是可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量