类不平衡医疗数据集数据离散化与数据重采样的交互效应。

IF 1.4 4区医学 Q4 ENGINEERING, BIOMEDICAL

Technology and Health Care Pub Date : 2025-03-01 Epub Date: 2024-11-25 DOI:10.1177/09287329241295874

Min-Wei Huang, Chih-Fong Tsai, Wei-Chao Lin, Jia-Yang Lin

{"title":"类不平衡医疗数据集数据离散化与数据重采样的交互效应。","authors":"Min-Wei Huang, Chih-Fong Tsai, Wei-Chao Lin, Jia-Yang Lin","doi":"10.1177/09287329241295874","DOIUrl":null,"url":null,"abstract":"BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.","PeriodicalId":48978,"journal":{"name":"Technology and Health Care","volume":"33 2","pages":"1000-1013"},"PeriodicalIF":1.4000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.\",\"authors\":\"Min-Wei Huang, Chih-Fong Tsai, Wei-Chao Lin, Jia-Yang Lin\",\"doi\":\"10.1177/09287329241295874\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.\",\"PeriodicalId\":48978,\"journal\":{\"name\":\"Technology and Health Care\",\"volume\":\"33 2\",\"pages\":\"1000-1013\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Technology and Health Care\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1177/09287329241295874\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/11/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology and Health Care","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1177/09287329241295874","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/25 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

摘要

背景数据离散化是数据挖掘中将连续特征值转化为离散特征值的重要预处理步骤，它允许一些特定的数据挖掘算法构建更有效的模型，简化数据挖掘过程。由于许多医学领域数据集存在类不平衡，数据重采样方法（包括过采样、欠采样和混合采样方法）被广泛应用于训练集的再平衡，促进了多数和少数类的有效区分。目的在本文中，我们研究了将数据离散化和数据重采样作为分析过程中的步骤对分类器性能的影响，这些分类器性能用于分类不平衡的医疗数据集。在实验中比较了这两个步骤的执行顺序。方法采用11个两类不平衡医疗数据集和3个多类不平衡医疗数据集进行实验研究。此外，采用的两种离散化算法是ChiMerge和最小描述长度原理（MDLP）。另一方面，性能比较选择的数据重采样算法有Tomek链路欠采样、合成少数派过采样技术（SMOTE）过采样和SMOTE-Tomek混合采样算法。此外，使用支持向量机（SVM）、C4.5决策树和随机森林（RF）技术来检验不同方法的分类性能。结果结果表明，对于两类和多类不平衡医疗数据集，平均而言，组合方法可以使分类器提供更高的ROC曲线下面积（AUC）率，分别约为0.8%-3.5%和0.9%-2.5%。特别是，对于两类不平衡数据集，首先采用MDLP方法进行数据离散，其次采用SMOTE方法进行过采样，获得了最佳结果，AUC率最高，计算成本最低。对于多类不平衡数据集，首先执行SMOTE或SMOTE- tomek进行数据重采样，其次执行ChiMerge进行数据离散可以提供最佳性能。结论带过采样的分类器比不带过采样的基线方法具有更好的分类性能。相反，执行数据离散化并不一定使分类器优于基线。平均而言，组合方法有可能允许分类器提供比最佳基线方法更高的AUC率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.

BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Technology and Health Care HEALTH CARE SCIENCES & SERVICES-ENGINEERING, BIOMEDICAL

CiteScore

2.10

自引率

6.20%

发文量

282

审稿时长

>12 weeks

期刊介绍： Technology and Health Care is intended to serve as a forum for the presentation of original articles and technical notes, observing rigorous scientific standards. Furthermore, upon invitation, reviews, tutorials, discussion papers and minisymposia are featured. The main focus of THC is related to the overlapping areas of engineering and medicine. The following types of contributions are considered: 1.Original articles: New concepts, procedures and devices associated with the use of technology in medical research and clinical practice are presented to a readership with a widespread background in engineering and/or medicine. In particular, the clinical benefit deriving from the application of engineering methods and devices in clinical medicine should be demonstrated. Typically, full length original contributions have a length of 4000 words, thereby taking duly into account figures and tables. 2.Technical Notes and Short Communications: Technical Notes relate to novel technical developments with relevance for clinical medicine. In Short Communications, clinical applications are shortly described. 3.Both Technical Notes and Short Communications typically have a length of 1500 words. Reviews and Tutorials (upon invitation only): Tutorial and educational articles for persons with a primarily medical background on principles of engineering with particular significance for biomedical applications and vice versa are presented. The Editorial Board is responsible for the selection of topics. 4.Minisymposia (upon invitation only): Under the leadership of a Special Editor, controversial or important issues relating to health care are highlighted and discussed by various authors. 5.Letters to the Editors: Discussions or short statements (not indexed).