{"title":"基于树模型的分类器与异常检测方法在医疗不平衡数据中的应用与性能","authors":"Yu Hidaka , Toru Imai , Katsuhiro Omae , Tomo Kagawa , Shigenao Ishikawa , Tomoki Inaba","doi":"10.1016/j.imu.2025.101677","DOIUrl":null,"url":null,"abstract":"<div><div>In medical data, analyzing imbalanced datasets, where positive cases are far fewer than negative cases, is a key challenge. Several approaches have been proposed, including anomaly detection and classifier-based methods; however, the optimal conditions for each remain unclear. In this study, which mainly focuses on tree model-based approaches, we systematically compared the effectiveness of classifier-based methods (synthetic minority oversampling technique, Under-bagging, Weighted Random Forest, and Balanced Random Forest) and the anomaly detection method, Isolation Forest, using 15 real-world medical datasets. All datasets involved binary classification problems, with sample sizes ranging from approximately 100 to 10,000 and positivity rates from 2% to 35%. The number of features per dataset ranged from 6 to 278, with categorical feature rates varying from 0% to 100%. Performance was primarily evaluated using the area under the receiver operating characteristic curve and the area under the precision–recall curve, which are particularly suitable for imbalanced data. The results showed that classifier-based methods performed poorly when positive cases did not form clusters in t-distributed stochastic neighbor embedding visualizations and when datasets contained a high proportion of categorical features. Conversely, anomaly detection approaches outperformed classifier-based methods under these conditions, especially with small sample sizes and high positivity rates. These findings provide practical guidance for selecting effective methods to address class imbalance in medical datasets.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"58 ","pages":"Article 101677"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application and performance of tree model-based classifier and anomaly-detection approaches for medical imbalanced data\",\"authors\":\"Yu Hidaka , Toru Imai , Katsuhiro Omae , Tomo Kagawa , Shigenao Ishikawa , Tomoki Inaba\",\"doi\":\"10.1016/j.imu.2025.101677\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In medical data, analyzing imbalanced datasets, where positive cases are far fewer than negative cases, is a key challenge. Several approaches have been proposed, including anomaly detection and classifier-based methods; however, the optimal conditions for each remain unclear. In this study, which mainly focuses on tree model-based approaches, we systematically compared the effectiveness of classifier-based methods (synthetic minority oversampling technique, Under-bagging, Weighted Random Forest, and Balanced Random Forest) and the anomaly detection method, Isolation Forest, using 15 real-world medical datasets. All datasets involved binary classification problems, with sample sizes ranging from approximately 100 to 10,000 and positivity rates from 2% to 35%. The number of features per dataset ranged from 6 to 278, with categorical feature rates varying from 0% to 100%. Performance was primarily evaluated using the area under the receiver operating characteristic curve and the area under the precision–recall curve, which are particularly suitable for imbalanced data. The results showed that classifier-based methods performed poorly when positive cases did not form clusters in t-distributed stochastic neighbor embedding visualizations and when datasets contained a high proportion of categorical features. Conversely, anomaly detection approaches outperformed classifier-based methods under these conditions, especially with small sample sizes and high positivity rates. These findings provide practical guidance for selecting effective methods to address class imbalance in medical datasets.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"58 \",\"pages\":\"Article 101677\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000668\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000668","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
Application and performance of tree model-based classifier and anomaly-detection approaches for medical imbalanced data
In medical data, analyzing imbalanced datasets, where positive cases are far fewer than negative cases, is a key challenge. Several approaches have been proposed, including anomaly detection and classifier-based methods; however, the optimal conditions for each remain unclear. In this study, which mainly focuses on tree model-based approaches, we systematically compared the effectiveness of classifier-based methods (synthetic minority oversampling technique, Under-bagging, Weighted Random Forest, and Balanced Random Forest) and the anomaly detection method, Isolation Forest, using 15 real-world medical datasets. All datasets involved binary classification problems, with sample sizes ranging from approximately 100 to 10,000 and positivity rates from 2% to 35%. The number of features per dataset ranged from 6 to 278, with categorical feature rates varying from 0% to 100%. Performance was primarily evaluated using the area under the receiver operating characteristic curve and the area under the precision–recall curve, which are particularly suitable for imbalanced data. The results showed that classifier-based methods performed poorly when positive cases did not form clusters in t-distributed stochastic neighbor embedding visualizations and when datasets contained a high proportion of categorical features. Conversely, anomaly detection approaches outperformed classifier-based methods under these conditions, especially with small sample sizes and high positivity rates. These findings provide practical guidance for selecting effective methods to address class imbalance in medical datasets.
期刊介绍:
Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.