基于Tversky相似度的欠采样高斯核决策残桩Adaboost算法用于不平衡医疗数据分类

M. Kamaladevi, V. Venkatraman
{"title":"基于Tversky相似度的欠采样高斯核决策残桩Adaboost算法用于不平衡医疗数据分类","authors":"M. Kamaladevi, V. Venkatraman","doi":"10.15837/ijccc.2021.6.4291","DOIUrl":null,"url":null,"abstract":"In recent years, imbalanced data classification are utilized in several domains including, detecting fraudulent activities in banking sector, disease prediction in healthcare sector and so on. To solve the Imbalanced classification problem at data level, strategy such as undersampling or oversampling are widely used. Sampling technique pose a challenge of significant information loss. The proposed method involves two processes namely, undersampling and classification. First, undersampling is performed by means of Tversky Similarity Indexive Regression model. Here, regression along with the Tversky similarity index is used in analyzing the relationship between two instances from the dataset. Next, Gaussian Kernelized Decision stump AdaBoosting is used for classifying the instances into two classes. Here, the root node in the Decision Stump takes a decision on the basis of the Gaussian Kernel function, considering average of neighboring points accordingly the results is obtained at the leaf node. Weights are also adjusted to minimizing the training errors occurring during classification to find the best classifier. Experimental assessment is performed with two different imbalanced dataset (Pima Indian diabetes and Hepatitis dataset). Various performance metrics such as precision, recall, AUC under ROC score and F1-score are compared with the existing undersampling methods. Experimental results showed that prediction accuracy of minority class has improved and therefore minimizing false positive and false negative.","PeriodicalId":179619,"journal":{"name":"Int. J. Comput. Commun. Control","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification\",\"authors\":\"M. Kamaladevi, V. Venkatraman\",\"doi\":\"10.15837/ijccc.2021.6.4291\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, imbalanced data classification are utilized in several domains including, detecting fraudulent activities in banking sector, disease prediction in healthcare sector and so on. To solve the Imbalanced classification problem at data level, strategy such as undersampling or oversampling are widely used. Sampling technique pose a challenge of significant information loss. The proposed method involves two processes namely, undersampling and classification. First, undersampling is performed by means of Tversky Similarity Indexive Regression model. Here, regression along with the Tversky similarity index is used in analyzing the relationship between two instances from the dataset. Next, Gaussian Kernelized Decision stump AdaBoosting is used for classifying the instances into two classes. Here, the root node in the Decision Stump takes a decision on the basis of the Gaussian Kernel function, considering average of neighboring points accordingly the results is obtained at the leaf node. Weights are also adjusted to minimizing the training errors occurring during classification to find the best classifier. Experimental assessment is performed with two different imbalanced dataset (Pima Indian diabetes and Hepatitis dataset). Various performance metrics such as precision, recall, AUC under ROC score and F1-score are compared with the existing undersampling methods. Experimental results showed that prediction accuracy of minority class has improved and therefore minimizing false positive and false negative.\",\"PeriodicalId\":179619,\"journal\":{\"name\":\"Int. J. Comput. Commun. Control\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Comput. Commun. Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15837/ijccc.2021.6.4291\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Commun. Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15837/ijccc.2021.6.4291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,不平衡数据分类在银行业欺诈行为检测、医疗保健行业疾病预测等领域得到了广泛应用。为了解决数据层面的不平衡分类问题,欠采样或过采样等策略被广泛使用。采样技术带来了重大信息丢失的挑战。该方法包括欠采样和分类两个过程。首先,利用Tversky相似性指数回归模型进行欠采样。在这里,回归和Tversky相似指数被用于分析数据集中两个实例之间的关系。其次,使用高斯核决策残桩AdaBoosting将实例分为两类。在这里,Decision Stump中的根节点根据高斯核函数进行决策,并考虑相邻点的平均值,得到叶节点的结果。还调整了权重以最小化分类过程中出现的训练错误,以找到最佳分类器。实验评估是用两个不同的不平衡数据集(皮马印度糖尿病和肝炎数据集)进行的。将精度、召回率、ROC评分下的AUC和f1评分等性能指标与现有欠采样方法进行了比较。实验结果表明,少数类的预测精度得到了提高,从而最大限度地减少了假阳性和假阴性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Tversky Similarity based UnderSampling with Gaussian Kernelized Decision Stump Adaboost Algorithm for Imbalanced Medical Data Classification
In recent years, imbalanced data classification are utilized in several domains including, detecting fraudulent activities in banking sector, disease prediction in healthcare sector and so on. To solve the Imbalanced classification problem at data level, strategy such as undersampling or oversampling are widely used. Sampling technique pose a challenge of significant information loss. The proposed method involves two processes namely, undersampling and classification. First, undersampling is performed by means of Tversky Similarity Indexive Regression model. Here, regression along with the Tversky similarity index is used in analyzing the relationship between two instances from the dataset. Next, Gaussian Kernelized Decision stump AdaBoosting is used for classifying the instances into two classes. Here, the root node in the Decision Stump takes a decision on the basis of the Gaussian Kernel function, considering average of neighboring points accordingly the results is obtained at the leaf node. Weights are also adjusted to minimizing the training errors occurring during classification to find the best classifier. Experimental assessment is performed with two different imbalanced dataset (Pima Indian diabetes and Hepatitis dataset). Various performance metrics such as precision, recall, AUC under ROC score and F1-score are compared with the existing undersampling methods. Experimental results showed that prediction accuracy of minority class has improved and therefore minimizing false positive and false negative.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信