{"title":"Classes versus Communities: Outlier Detection and Removal in Tabular Datasets via Social Network Analysis (ClaCO)","authors":"Serkan Üçer, Tansel Özyer, R. Alhajj","doi":"10.1109/ASONAM55673.2022.10068694","DOIUrl":null,"url":null,"abstract":"In this research, we introduce a model to detect inconsistent & anomalous samples in tabular labeled datasets which are used in machine learning classification tasks, frequently. Our model, abbreviated as the ClaCO (Classes vs. Communities: SNA for Outlier Detection), first converts tabular data with labels into an attributed and labeled undirected network graph. Following the enrichment of the graph, it analyses the edge structure of the individual egonets, in terms of the class and community belongings, by introducing a new SNA metric named as ‘the Consistency Score of a Node - CSoN’. Through an exhaustive analysis of the ego network of a node, CSoN tries to exhibit consistency of a node by examining the similarity of its immediate neighbors in terms of shared class and/or shared community belongings. To prove the efficiency of the proposed ClaCO, we employed it as a subsidiary method for detecting anomalous samples in the train part in the traditional ML classification task. With the help of this new consistency score, the least CSoN scored set of nodes flagged as outliers and removed from the training dataset, and remaining part fed into the ML model to see the effect on classification performance with the ‘whole’ dataset through competing outlier detection methods. We have shown this outlier detection model as an efficient method since it improves classification performance both on the whole dataset and reduced datasets with competing outlier detection methods, over several known both real-life and synthetic datasets.","PeriodicalId":423113,"journal":{"name":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM55673.2022.10068694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this research, we introduce a model to detect inconsistent & anomalous samples in tabular labeled datasets which are used in machine learning classification tasks, frequently. Our model, abbreviated as the ClaCO (Classes vs. Communities: SNA for Outlier Detection), first converts tabular data with labels into an attributed and labeled undirected network graph. Following the enrichment of the graph, it analyses the edge structure of the individual egonets, in terms of the class and community belongings, by introducing a new SNA metric named as ‘the Consistency Score of a Node - CSoN’. Through an exhaustive analysis of the ego network of a node, CSoN tries to exhibit consistency of a node by examining the similarity of its immediate neighbors in terms of shared class and/or shared community belongings. To prove the efficiency of the proposed ClaCO, we employed it as a subsidiary method for detecting anomalous samples in the train part in the traditional ML classification task. With the help of this new consistency score, the least CSoN scored set of nodes flagged as outliers and removed from the training dataset, and remaining part fed into the ML model to see the effect on classification performance with the ‘whole’ dataset through competing outlier detection methods. We have shown this outlier detection model as an efficient method since it improves classification performance both on the whole dataset and reduced datasets with competing outlier detection methods, over several known both real-life and synthetic datasets.
在本研究中,我们引入了一个模型来检测机器学习分类任务中经常使用的表格标记数据集中的不一致和异常样本。我们的模型,缩写为ClaCO (Classes vs. Communities: SNA for Outlier Detection),首先将带有标签的表格数据转换为带有属性和标记的无向网络图。在图的丰富之后,它通过引入一个新的SNA度量,称为“节点的一致性得分- CSoN”,从类和社区财产的角度分析了个体自我的边缘结构。通过对节点自我网络的详尽分析,CSoN试图通过检查其近邻在共享类和/或共享社区财产方面的相似性来展示节点的一致性。为了证明ClaCO的有效性,我们将其作为传统ML分类任务中训练部分异常样本检测的辅助方法。在这个新的一致性评分的帮助下,CSoN得分最低的节点集被标记为离群值并从训练数据集中删除,其余部分输入ML模型,通过竞争的离群值检测方法查看对“整个”数据集分类性能的影响。我们已经证明了这种离群值检测模型是一种有效的方法,因为它在几个已知的真实数据集和合成数据集上,通过竞争的离群值检测方法,提高了整个数据集和简化数据集的分类性能。