PEARSON CORRELATION COEFFICIENT K-NEAREST NEIGHBOR OUTLIER CLASSIFICATION ON REAL-TIME DATASETS

D. Rajakumari, S. Karthika
{"title":"PEARSON CORRELATION COEFFICIENT K-NEAREST NEIGHBOR OUTLIER CLASSIFICATION ON REAL-TIME DATASETS","authors":"D. Rajakumari, S. Karthika","doi":"10.21917/ijsc.2020.0290","DOIUrl":null,"url":null,"abstract":"Detection and classification of data that do not meet the expected behavior (outliers) plays the major role in wide variety of applications such as military surveillance, intrusion detection in cyber security, fraud detection in on-line transactions. Nowadays, an accurate detection of outliers with high dimension is the major issue. The trade-off between the high-accuracy and low computational time is the major requirement in outlier prediction and classification. The presence of large size diverse features need the reduction mechanism prior to classification approach. To achieve this, the Distance-based Outlier Classification (DOC) is proposed in this paper. The proposed work utilizes the Pearson Correlation Coefficient (PCC) to measure the correlation between the data instances. The minimum instance learning through PCC estimation reduces the dimensionality. The proposed work is split up into two phases namely training and testing.  During the training process, the labeling of most frequent samples isolates them from the infrequent reduce the data size effectively. The testing phase employs the k-Nearest Neighborhood (k-NN) scheme to classify the frequent samples effectively. The dimensionality and the k-value are inversely proportional to each other. In proposed work, the selection of large value of k offers the significant reduction in dimensionality. The combination of PCC-based instance learning and the high value of k reduces the dimensionality and noise respectively. The comparative analysis between the proposed PCC-k-NN with the conventional algorithms such as Decision Tree, Naive Bayes, Instance-Based K-means (IBK), Triangular Boundary-based Classification (TBC) regarding sensitivity, specificity, accuracy, precision, and recall proves its effectiveness in OC. Besides, the experimental validation of proposed PCC-k-NN with the state-of art methods regarding the execution time assures trade-off between the low-time consumption and high-accuracy.","PeriodicalId":428598,"journal":{"name":"Programmable Device Circuits and Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Programmable Device Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21917/ijsc.2020.0290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Detection and classification of data that do not meet the expected behavior (outliers) plays the major role in wide variety of applications such as military surveillance, intrusion detection in cyber security, fraud detection in on-line transactions. Nowadays, an accurate detection of outliers with high dimension is the major issue. The trade-off between the high-accuracy and low computational time is the major requirement in outlier prediction and classification. The presence of large size diverse features need the reduction mechanism prior to classification approach. To achieve this, the Distance-based Outlier Classification (DOC) is proposed in this paper. The proposed work utilizes the Pearson Correlation Coefficient (PCC) to measure the correlation between the data instances. The minimum instance learning through PCC estimation reduces the dimensionality. The proposed work is split up into two phases namely training and testing.  During the training process, the labeling of most frequent samples isolates them from the infrequent reduce the data size effectively. The testing phase employs the k-Nearest Neighborhood (k-NN) scheme to classify the frequent samples effectively. The dimensionality and the k-value are inversely proportional to each other. In proposed work, the selection of large value of k offers the significant reduction in dimensionality. The combination of PCC-based instance learning and the high value of k reduces the dimensionality and noise respectively. The comparative analysis between the proposed PCC-k-NN with the conventional algorithms such as Decision Tree, Naive Bayes, Instance-Based K-means (IBK), Triangular Boundary-based Classification (TBC) regarding sensitivity, specificity, accuracy, precision, and recall proves its effectiveness in OC. Besides, the experimental validation of proposed PCC-k-NN with the state-of art methods regarding the execution time assures trade-off between the low-time consumption and high-accuracy.
实时数据集的Pearson相关系数k -最近邻离群值分类
对不符合预期行为的数据(异常值)进行检测和分类,在军事监视、网络安全中的入侵检测、在线交易中的欺诈检测等各种应用中发挥着重要作用。高维异常点的准确检测是当前的主要问题。在离群值预测和分类中,高精度和低计算时间之间的权衡是主要的要求。在分类方法之前,需要对大尺寸不同特征的存在进行约简机制。为此,本文提出了基于距离的离群值分类方法(DOC)。提出的工作利用Pearson相关系数(PCC)来衡量数据实例之间的相关性。通过PCC估计的最小实例学习降低了维数。建议的工作分为两个阶段,即培训和测试。在训练过程中,最频繁样本的标记将其与不频繁样本隔离开来,有效地减小了数据量。测试阶段采用k近邻算法(k-NN)对频繁样本进行有效分类。维数与k值成反比。在建议的工作中,选择较大的k值可以显著降低维数。基于pc的实例学习和k的高值相结合,分别降低了维数和噪声。将PCC-k-NN与决策树、朴素贝叶斯、基于实例的K-means (IBK)、基于三角边界的分类(TBC)等传统算法在灵敏度、特异度、准确度、精密度和召回率等方面进行对比分析,证明了PCC-k-NN在OC中的有效性。此外,采用最先进的执行时间方法对所提出的PCC-k-NN进行了实验验证,保证了低耗时和高准确率之间的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信