Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene
{"title":"照亮增强机器学习模型抵御缺失标签阴影的弹性之路。","authors":"Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene","doi":"10.1109/JBHI.2025.3558846","DOIUrl":null,"url":null,"abstract":"<p><p>The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\\pm$11% to 95$\\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Illuminating The Path To Enhanced Resilience Of Machine Learning Models Against The Shadows Of Missing Labels.\",\"authors\":\"Simankov Nikolay, Tahzima Rachid, Massart Sebastien, Soyeurt Helene\",\"doi\":\"10.1109/JBHI.2025.3558846\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\\\\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\\\\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\\\\pm$11% to 95$\\\\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.</p>\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1109/JBHI.2025.3558846\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2025.3558846","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Illuminating The Path To Enhanced Resilience Of Machine Learning Models Against The Shadows Of Missing Labels.
The sensitivity of state-of-the-art supervised classification models is compromised by contamination-prone biomedical datasets, which are vulnerable to the presence of missing or erroneous labels (i.e., inliers). Starting from codon frequencies, electrocardiogram signals, biomarkers, morphological features, and patient questionnaires, we attempted to cover a wide range of typical biomedical databases exposed to the risk of missing data labeled as negative values (inlier contamination). In some very niche fields, such as image recognition, missing labels have received a lot of attention, but in biomedical and clinical research, where outliers are almost systematically filtered, inliers have remained orphans. Our study introduced a pragmatic and innovative automated methodology that consists of upcycling one-class semi-supervised anomaly detection (OCSSAD) models for filtering potential inliers in training datasets. Five OCSSAD and two ensemble methods were benchmarked on 6 databases with 10 different contamination levels and 10 random samples, achieving an average Matthews correlation coefficient (MCC) of 78$\pm$17% in validation, whereas 22 supervised classifiers achieved an average MCC score of 81$\pm$9% trained with the complete and uncontaminated trainset.Therefore, by filtering the training set with an isolation forest, the average resilience to inliers of 22 tested Machine Learning models increased from 69$\pm$11% to 95$\pm$1%, including neural networks and gradient-boosting methods. Taken together, our study showcased the efficacy of our versatile approach in enhancing the resilience of Machine Learning models and highlighted the importance of accurately addressing the inliers challenge in the domains of medical and Life Sciences.
期刊介绍:
IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.