Addressing label noise for electronic health records: insights from computer vision for tabular data.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-06-27 DOI:10.1186/s12911-024-02581-5

Jenny Yang, Hagen Triendl, Andrew A S Soltan, Mangal Prakash, David A Clifton

{"title":"Addressing label noise for electronic health records: insights from computer vision for tabular data.","authors":"Jenny Yang, Hagen Triendl, Andrew A S Soltan, Mangal Prakash, David A Clifton","doi":"10.1186/s12911-024-02581-5","DOIUrl":null,"url":null,"abstract":"<p><p>The analysis of extensive electronic health records (EHR) datasets often calls for automated solutions, with machine learning (ML) techniques, including deep learning (DL), taking a lead role. One common task involves categorizing EHR data into predefined groups. However, the vulnerability of EHRs to noise and errors stemming from data collection processes, as well as potential human labeling errors, poses a significant risk. This risk is particularly prominent during the training of DL models, where the possibility of overfitting to noisy labels can have serious repercussions in healthcare. Despite the well-documented existence of label noise in EHR data, few studies have tackled this challenge within the EHR domain. Our work addresses this gap by adapting computer vision (CV) algorithms to mitigate the impact of label noise in DL models trained on EHR data. Notably, it remains uncertain whether CV methods, when applied to the EHR domain, will prove effective, given the substantial divergence between the two domains. We present empirical evidence demonstrating that these methods, whether used individually or in combination, can substantially enhance model performance when applied to EHR data, especially in the presence of noisy/incorrect labels. We validate our methods and underscore their practical utility in real-world EHR data, specifically in the context of COVID-19 diagnosis. Our study highlights the effectiveness of CV methods in the EHR domain, making a valuable contribution to the advancement of healthcare analytics and research.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"183"},"PeriodicalIF":3.3000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11212446/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02581-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

The analysis of extensive electronic health records (EHR) datasets often calls for automated solutions, with machine learning (ML) techniques, including deep learning (DL), taking a lead role. One common task involves categorizing EHR data into predefined groups. However, the vulnerability of EHRs to noise and errors stemming from data collection processes, as well as potential human labeling errors, poses a significant risk. This risk is particularly prominent during the training of DL models, where the possibility of overfitting to noisy labels can have serious repercussions in healthcare. Despite the well-documented existence of label noise in EHR data, few studies have tackled this challenge within the EHR domain. Our work addresses this gap by adapting computer vision (CV) algorithms to mitigate the impact of label noise in DL models trained on EHR data. Notably, it remains uncertain whether CV methods, when applied to the EHR domain, will prove effective, given the substantial divergence between the two domains. We present empirical evidence demonstrating that these methods, whether used individually or in combination, can substantially enhance model performance when applied to EHR data, especially in the presence of noisy/incorrect labels. We validate our methods and underscore their practical utility in real-world EHR data, specifically in the context of COVID-19 diagnosis. Our study highlights the effectiveness of CV methods in the EHR domain, making a valuable contribution to the advancement of healthcare analytics and research.

查看原文本刊更多论文

解决电子健康记录的标签噪声问题：计算机视觉对表格数据的启示。

对大量电子健康记录（EHR）数据集进行分析通常需要自动化解决方案，而机器学习（ML）技术，包括深度学习（DL），在其中发挥着主导作用。一项常见的任务是将电子病历数据归类到预定义的组中。然而，电子病历容易受到数据收集过程中产生的噪音和错误以及潜在的人为标记错误的影响，这带来了巨大的风险。这种风险在 DL 模型的训练过程中尤为突出，因为过度拟合噪声标签可能会对医疗保健造成严重影响。尽管电子病历数据中存在标签噪声，但在电子病历领域应对这一挑战的研究却寥寥无几。我们的研究通过调整计算机视觉（CV）算法来减轻在 EHR 数据上训练的 DL 模型中标签噪声的影响，从而填补了这一空白。值得注意的是，由于电子病历和计算机视觉在这两个领域之间存在很大差异，因此计算机视觉方法应用于电子病历领域是否有效仍不确定。我们提出的经验证据表明，这些方法无论是单独使用还是组合使用，在应用于电子病历数据时都能大幅提高模型性能，尤其是在存在噪声/不正确标签的情况下。我们验证了我们的方法，并强调了这些方法在真实 EHR 数据中的实用性，特别是在 COVID-19 诊断中。我们的研究凸显了 CV 方法在电子病历领域的有效性，为医疗分析和研究的进步做出了宝贵的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.