{"title":"A Novel Pre-processing Method for Classification Problems in Medical Intelligent Tasks","authors":"Haochen Jiang, Ziqi Wei, Jun Chen","doi":"10.1109/icdh52753.2021.00032","DOIUrl":null,"url":null,"abstract":"In the industry of medical intelligence, classification is one of the most common tasks. It appears in various medical jobs, such as triage, diagnosis, and pathologic analysis. Many classification algorithms studied in machine learning can be chosen to help solve these tasks. However, due to the special nature of the medical industry, its data sets show a character of imbalance. Namely, the data are skewed distributed in different classes. Unfortunately, the classification problem of imbalanced data has a reputation of classic and hard-to-solve in data mining and artificial intelligence research community. What's worse, most proposed classification methods are designed to deal with binary classification case, while the common scenario in medical intelligence applications is multi-classification. To deal with this, a pre-processing structure called Cost-Sensitive Variable Neighbour Search (CSVNS) is proposed in this paper. It combines the ideas of sampling and cost-sensitive, which are two most commonly used strategies for multi-class imbalanced data classification tasks. As for the sampling process, a double-stack Variable Neighbour Search (VNS) structure is introduced and 15 different neighborhood structures are designed to help optimizing the process. Also, the classes are allocated different weights to improve the classifier's classification capacity. In the experiment part, the proposed method is evaluated on 4 medical data sets. $G$ - mean and mAUC are selected to represent the method's performance in medical classification tasks. Experimental results show the proposed method outperforms the classic methods in most situations. In the end, 3 extra data sets are tested to demonstrate the algorithms' scalability.","PeriodicalId":93401,"journal":{"name":"2021 IEEE International Conference on Digital Health (ICDH)","volume":"1 1","pages":"178-183"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Digital Health (ICDH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icdh52753.2021.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In the industry of medical intelligence, classification is one of the most common tasks. It appears in various medical jobs, such as triage, diagnosis, and pathologic analysis. Many classification algorithms studied in machine learning can be chosen to help solve these tasks. However, due to the special nature of the medical industry, its data sets show a character of imbalance. Namely, the data are skewed distributed in different classes. Unfortunately, the classification problem of imbalanced data has a reputation of classic and hard-to-solve in data mining and artificial intelligence research community. What's worse, most proposed classification methods are designed to deal with binary classification case, while the common scenario in medical intelligence applications is multi-classification. To deal with this, a pre-processing structure called Cost-Sensitive Variable Neighbour Search (CSVNS) is proposed in this paper. It combines the ideas of sampling and cost-sensitive, which are two most commonly used strategies for multi-class imbalanced data classification tasks. As for the sampling process, a double-stack Variable Neighbour Search (VNS) structure is introduced and 15 different neighborhood structures are designed to help optimizing the process. Also, the classes are allocated different weights to improve the classifier's classification capacity. In the experiment part, the proposed method is evaluated on 4 medical data sets. $G$ - mean and mAUC are selected to represent the method's performance in medical classification tasks. Experimental results show the proposed method outperforms the classic methods in most situations. In the end, 3 extra data sets are tested to demonstrate the algorithms' scalability.