Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis

IF 2.6 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Journal of Physics Complexity Pub Date : 2022-12-06 DOI:10.1088/2632-072X/aca94a

Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller

{"title":"Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis","authors":"Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller","doi":"10.1088/2632-072X/aca94a","DOIUrl":null,"url":null,"abstract":"Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.","PeriodicalId":53211,"journal":{"name":"Journal of Physics Complexity","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Physics Complexity","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2632-072X/aca94a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 1

Abstract

Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.

查看原文本刊更多论文

基于Jensen–Shannon散度和图结构分析的高维数据异常点挖掘

可靠的异常/异常点检测算法在许多领域都有实际应用。例如，异常检测可以过滤和清理用于训练机器学习算法的数据，从而提高其性能。然而，当数据是高维的时，异常值挖掘是具有挑战性的，并且已经针对不同类型的数据（时间、空间、网络等）提出了不同的方法。在这里，我们提出了一种在通用数据集中挖掘异常值的方法，其中可以定义数据集元素之间的有意义的距离。该方法基于定义一个完全连接的无向图，其中节点是数据集的元素，链接的权重是节点之间的距离。异常值分数是通过分析图的结构来定义的，特别是通过使用Jensen–Shannon（JS）散度来比较不同节点的权重分布。我们使用公开的信用卡交易数据库演示了该方法，其中一些交易被标记为欺诈。我们将其与使用欧几里得距离和图渗滤时获得的性能进行了比较，并表明JS发散导致了性能的提高，但增加了计算成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊