The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-09-26 DOI:10.1186/s12911-025-03178-2

Rildo Pinto da Silva, Antonio Pazin-Filho

{"title":"The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study.","authors":"Rildo Pinto da Silva, Antonio Pazin-Filho","doi":"10.1186/s12911-025-03178-2","DOIUrl":null,"url":null,"abstract":"Background: While it is advocated that the use of unstructured data extracted from medical records is important for enhancing machine learning models, few studies have evaluated whether this occurs. A retrospective, head-to-head comparative study was conducted to evaluate machine learning models for in-hospital mortality prediction. The study assessed and quantified the potential performance improvement resulting from the inclusion of unstructured data.Methods: Hospitalizations of patients with a confirmed COVID-19 diagnosis at a tertiary teaching hospital specialized in emergency care were selected (n = 844). For the models with structured data, 21 variables were selected from laboratory tests and patient monitoring. For the hybrid models, an additional 21 clinical assertions (e.g., \"has_symptom affirmed dyspnea\") were included. Six models with the best discriminative performance out of 11 trained and validated were selected for the testing phase. The most representative variables were evaluated using an explainable artificial intelligence model.Results: The random forest model demonstrated the highest performance, achieving an area under the receiver operating characteristic curve (AUC ROC) of 0.9260, an increase from 0.9170 when using only structured data. The inclusion of unstructured data also improved sensitivity from 0.8108 to 0.8378 while specificity was maintained at 0.8667. However, these performance improvements were not found to be statistically significant different from models with only structured data.Conclusion: The study concluded that the inclusion of unstructured data did not increase the predictive power of machine learning models for COVID-19 mortality. It was also determined that human involvement is crucial for implementation, specifically for validating natural language processing (NLP) outputs and tailoring the selection of unstructured features, given the inherent challenges in processing such data.Clinical trial number: Not applicable.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"333"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12465758/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03178-2","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: While it is advocated that the use of unstructured data extracted from medical records is important for enhancing machine learning models, few studies have evaluated whether this occurs. A retrospective, head-to-head comparative study was conducted to evaluate machine learning models for in-hospital mortality prediction. The study assessed and quantified the potential performance improvement resulting from the inclusion of unstructured data.

Methods: Hospitalizations of patients with a confirmed COVID-19 diagnosis at a tertiary teaching hospital specialized in emergency care were selected (n = 844). For the models with structured data, 21 variables were selected from laboratory tests and patient monitoring. For the hybrid models, an additional 21 clinical assertions (e.g., "has_symptom affirmed dyspnea") were included. Six models with the best discriminative performance out of 11 trained and validated were selected for the testing phase. The most representative variables were evaluated using an explainable artificial intelligence model.

Results: The random forest model demonstrated the highest performance, achieving an area under the receiver operating characteristic curve (AUC ROC) of 0.9260, an increase from 0.9170 when using only structured data. The inclusion of unstructured data also improved sensitivity from 0.8108 to 0.8378 while specificity was maintained at 0.8667. However, these performance improvements were not found to be statistically significant different from models with only structured data.

Conclusion: The study concluded that the inclusion of unstructured data did not increase the predictive power of machine learning models for COVID-19 mortality. It was also determined that human involvement is crucial for implementation, specifically for validating natural language processing (NLP) outputs and tailoring the selection of unstructured features, given the inherent challenges in processing such data.

Clinical trial number: Not applicable.

Abstract Image

查看原文本刊更多论文

通过自然语言处理的非结构化数据在基于机器学习的COVID-19死亡率预测中的增量价值：一项比较研究。

背景：虽然提倡使用从医疗记录中提取的非结构化数据对于增强机器学习模型很重要，但很少有研究评估这种情况是否会发生。进行了一项回顾性、正面比较研究，以评估机器学习模型用于院内死亡率预测。该研究评估并量化了包含非结构化数据所带来的潜在性能改进。方法：选取某三级教学医院急诊专科确诊的新冠肺炎住院患者844例。对于具有结构化数据的模型，从实验室测试和患者监测中选择21个变量。对于混合模型，额外的21个临床断言（例如，“有症状确认呼吸困难”）被包括在内。在经过训练和验证的11个模型中，选择6个判别性能最好的模型进入测试阶段。使用可解释的人工智能模型评估最具代表性的变量。结果：随机森林模型表现出最高的性能，在接收者工作特征曲线下的面积（AUC ROC）为0.9260，比仅使用结构化数据时的0.9170有所增加。纳入非结构化数据也将敏感性从0.8108提高到0.8378，特异性维持在0.8667。然而，这些性能改进与只有结构化数据的模型没有统计学上的显著差异。结论：该研究得出结论，纳入非结构化数据并没有提高机器学习模型对COVID-19死亡率的预测能力。考虑到处理此类数据的固有挑战，还确定人类参与对于实现至关重要，特别是对于验证自然语言处理（NLP）输出和裁剪非结构化特征的选择。临床试验号：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.