Feature selection algorithm using information gain based clustering for supporting the treatment process of breast cancer

2016 International Conference on Informatics and Computing (ICIC) Pub Date : 1900-01-01 DOI:10.1109/IAC.2016.7905680

Tresna Maulana Fahrudin, I. Syarif, Ali Ridho Barakbah

{"title":"Feature selection algorithm using information gain based clustering for supporting the treatment process of breast cancer","authors":"Tresna Maulana Fahrudin, I. Syarif, Ali Ridho Barakbah","doi":"10.1109/IAC.2016.7905680","DOIUrl":null,"url":null,"abstract":"Breast cancer is the second highest cancer type that attacked Indonesia women. The high breast cancer patients in Indonesia also have an impact on their life expectancy to recover by treatment routinely. Malignancies and death probability are some factor of many determinants of breast cancer patient's recovery. This research examines the determinant factor of breast cancer patient treatment based on the latest condition. The dataset was originally taken from one of oncology hospital in East Java, Indonesia, which is consist of 1907 samples, 18 attributes and 2 classes. We used information gain as feature selection technique by using the entropy formula to select the best attributes that have great contribution to the data. We used clustering algorithm to get the number of attributes can be removed that available from ranking attributes by Information Gain. This clustering algorithm used Hierarchical K-means (K-means optimization) categorized patients into two groups which are normal and cancer. Our experiments show that the information gain method selected 12 of 18 attributes that have the highest contribution factor of the breast cancer patient treatment based on the last condition. The clustering algorithm error ratio was slighly decreased from 44.48% (using 18 original attributes) to 21.42% (using 12 most important attributes).","PeriodicalId":404904,"journal":{"name":"2016 International Conference on Informatics and Computing (ICIC)","volume":"179 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAC.2016.7905680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Breast cancer is the second highest cancer type that attacked Indonesia women. The high breast cancer patients in Indonesia also have an impact on their life expectancy to recover by treatment routinely. Malignancies and death probability are some factor of many determinants of breast cancer patient's recovery. This research examines the determinant factor of breast cancer patient treatment based on the latest condition. The dataset was originally taken from one of oncology hospital in East Java, Indonesia, which is consist of 1907 samples, 18 attributes and 2 classes. We used information gain as feature selection technique by using the entropy formula to select the best attributes that have great contribution to the data. We used clustering algorithm to get the number of attributes can be removed that available from ranking attributes by Information Gain. This clustering algorithm used Hierarchical K-means (K-means optimization) categorized patients into two groups which are normal and cancer. Our experiments show that the information gain method selected 12 of 18 attributes that have the highest contribution factor of the breast cancer patient treatment based on the last condition. The clustering algorithm error ratio was slighly decreased from 44.48% (using 18 original attributes) to 21.42% (using 12 most important attributes).

查看原文本刊更多论文

基于信息增益的聚类特征选择算法支持乳腺癌的治疗过程

乳腺癌是印尼女性发病率第二高的癌症类型。印度尼西亚的高乳腺癌患者通过常规治疗对其预期寿命的恢复也有影响。恶性肿瘤和死亡概率是影响乳腺癌患者康复的诸多因素之一。本研究以最新病情为基础，探讨乳腺癌患者治疗的决定因素。该数据集最初取自印度尼西亚东爪哇的一家肿瘤医院，由1907个样本、18个属性和2个类别组成。我们将信息增益作为特征选择技术，利用熵公式选择对数据贡献较大的最佳属性。我们使用聚类算法，通过信息增益获得可从属性排序中去除的可用属性的数量。该聚类算法采用分层K-means (K-means optimization)将患者分为正常组和肿瘤组。我们的实验表明，信息增益法根据最后的情况，从18个属性中选择了对乳腺癌患者治疗贡献因子最高的12个属性。聚类算法错误率从44.48%(使用18个原始属性)略微降低到21.42%(使用12个最重要属性)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Informatics and Computing (ICIC)

自引率

0.00%

发文量