Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis

IF 2.6 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS
Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller
{"title":"Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis","authors":"Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller","doi":"10.1088/2632-072X/aca94a","DOIUrl":null,"url":null,"abstract":"Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.","PeriodicalId":53211,"journal":{"name":"Journal of Physics Complexity","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Physics Complexity","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2632-072X/aca94a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 1

Abstract

Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.
基于Jensen–Shannon散度和图结构分析的高维数据异常点挖掘
可靠的异常/异常点检测算法在许多领域都有实际应用。例如,异常检测可以过滤和清理用于训练机器学习算法的数据,从而提高其性能。然而,当数据是高维的时,异常值挖掘是具有挑战性的,并且已经针对不同类型的数据(时间、空间、网络等)提出了不同的方法。在这里,我们提出了一种在通用数据集中挖掘异常值的方法,其中可以定义数据集元素之间的有意义的距离。该方法基于定义一个完全连接的无向图,其中节点是数据集的元素,链接的权重是节点之间的距离。异常值分数是通过分析图的结构来定义的,特别是通过使用Jensen–Shannon(JS)散度来比较不同节点的权重分布。我们使用公开的信用卡交易数据库演示了该方法,其中一些交易被标记为欺诈。我们将其与使用欧几里得距离和图渗滤时获得的性能进行了比较,并表明JS发散导致了性能的提高,但增加了计算成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Physics Complexity
Journal of Physics Complexity Computer Science-Information Systems
CiteScore
4.30
自引率
11.10%
发文量
45
审稿时长
14 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信