照亮增强机器学习模型抵御缺失标签阴影的弹性之路。

IF 6.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Journal of Biomedical and Health Informatics Pub Date : 2025-04-08 DOI:10.1109/JBHI.2025.3558846

Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene

{"title":"照亮增强机器学习模型抵御缺失标签阴影的弹性之路。","authors":"Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene","doi":"10.1109/JBHI.2025.3558846","DOIUrl":null,"url":null,"abstract":"The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\\pm$11% to 95$\\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Illuminating The Path To Enhanced Resilience Of Machine Learning Models Against The Shadows Of Missing Labels.\",\"authors\":\"Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene\",\"doi\":\"10.1109/JBHI.2025.3558846\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\\\\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\\\\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\\\\pm$11% to 95$\\\\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/JBHI.2025.3558846\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2025.3558846","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

最先进的监督分类模型的敏感性受到易受污染的生物医学数据集的影响，这些数据集容易受到缺失或错误标签（即内层）的影响。从密码子频率、心电图信号、生物标志物、形态特征和患者问卷开始，我们试图覆盖广泛的典型生物医学数据库，这些数据库暴露于被标记为负值的数据缺失风险（内层污染）。在一些非常小众的领域，比如图像识别，缺失标签受到了很多关注，但在生物医学和临床研究中，异常值几乎被系统地过滤掉了，而内部标签仍然是孤儿。我们的研究引入了一种实用和创新的自动化方法，该方法由升级的一类半监督异常检测（OCSSAD）模型组成，用于过滤训练数据集中的潜在内层。五种OCSSAD和两种集成方法在6个数据库上进行了基准测试，其中包含10个不同的污染水平和10个随机样本，验证的平均马修斯相关系数（MCC）为78$\pm$17%，而22种监督分类器在完整和未污染的训练集上的平均MCC得分为81$\pm$9%。因此，通过用隔离森林过滤训练集，22个被测试的机器学习模型对内线的平均弹性从69$\pm$11%增加到95$\pm$1%，包括神经网络和梯度增强方法。综上所述，我们的研究展示了我们的通用方法在增强机器学习模型弹性方面的有效性，并强调了准确解决医学和生命科学领域内嵌挑战的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Illuminating The Path To Enhanced Resilience Of Machine Learning Models Against The Shadows Of Missing Labels.

The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\pm$11% to 95$\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.