基于概率密度聚类的离群点检测算法

IF 0.7 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

International Journal of Data Warehousing and Mining Pub Date : 2023-11-21 DOI:10.4018/ijdwm.333901

Wei Wang, Yongjian Ren, Renjie Zhou, Jilin Zhang

{"title":"基于概率密度聚类的离群点检测算法","authors":"Wei Wang, Yongjian Ren, Renjie Zhou, Jilin Zhang","doi":"10.4018/ijdwm.333901","DOIUrl":null,"url":null,"abstract":"Outlier detection for batch and streaming data is an important branch of data mining. However, there are shortcomings for existing algorithms. For batch data, the outlier detection algorithm, only labeling a few data points, is not accurate enough because it uses histogram strategy to generate feature vectors. For streaming data, the outlier detection algorithms are sensitive to data distance, resulting in low accuracy when sparse clusters and dense clusters are close to each other. Moreover, they require tuning of parameters, which takes a lot of time. With this, the manuscript per the authors propose a new outlier detection algorithm, called PDC which use probability density to generate feature vectors to train a lightweight machine learning model that is finally applied to detect outliers. PDC takes advantages of accuracy and insensitivity-to-data-distance of probability density, so it can overcome the aforementioned drawbacks.","PeriodicalId":54963,"journal":{"name":"International Journal of Data Warehousing and Mining","volume":"31 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Outlier Detection Algorithm Based on Probability Density Clustering\",\"authors\":\"Wei Wang, Yongjian Ren, Renjie Zhou, Jilin Zhang\",\"doi\":\"10.4018/ijdwm.333901\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Outlier detection for batch and streaming data is an important branch of data mining. However, there are shortcomings for existing algorithms. For batch data, the outlier detection algorithm, only labeling a few data points, is not accurate enough because it uses histogram strategy to generate feature vectors. For streaming data, the outlier detection algorithms are sensitive to data distance, resulting in low accuracy when sparse clusters and dense clusters are close to each other. Moreover, they require tuning of parameters, which takes a lot of time. With this, the manuscript per the authors propose a new outlier detection algorithm, called PDC which use probability density to generate feature vectors to train a lightweight machine learning model that is finally applied to detect outliers. PDC takes advantages of accuracy and insensitivity-to-data-distance of probability density, so it can overcome the aforementioned drawbacks.\",\"PeriodicalId\":54963,\"journal\":{\"name\":\"International Journal of Data Warehousing and Mining\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Data Warehousing and Mining\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.4018/ijdwm.333901\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Warehousing and Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.4018/ijdwm.333901","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

批量数据和流数据的离群点检测是数据挖掘的一个重要分支。然而，现有算法也存在不足之处。对于批量数据，离群点检测算法只标记几个数据点，由于使用直方图策略生成特征向量，因此不够准确。对于流数据，离群点检测算法对数据距离很敏感，当稀疏聚类和密集聚类彼此靠近时，准确率会很低。此外，这些算法需要调整参数，耗费大量时间。有鉴于此，作者在手稿中提出了一种新的离群值检测算法，称为 PDC，它利用概率密度生成特征向量，训练轻量级机器学习模型，最后应用于检测离群值。PDC 利用了概率密度的准确性和对数据距离不敏感的优点，因此可以克服上述缺点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Outlier Detection Algorithm Based on Probability Density Clustering

Outlier detection for batch and streaming data is an important branch of data mining. However, there are shortcomings for existing algorithms. For batch data, the outlier detection algorithm, only labeling a few data points, is not accurate enough because it uses histogram strategy to generate feature vectors. For streaming data, the outlier detection algorithms are sensitive to data distance, resulting in low accuracy when sparse clusters and dense clusters are close to each other. Moreover, they require tuning of parameters, which takes a lot of time. With this, the manuscript per the authors propose a new outlier detection algorithm, called PDC which use probability density to generate feature vectors to train a lightweight machine learning model that is finally applied to detect outliers. PDC takes advantages of accuracy and insensitivity-to-data-distance of probability density, so it can overcome the aforementioned drawbacks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Data Warehousing and Mining COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

2.40

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： The International Journal of Data Warehousing and Mining (IJDWM) disseminates the latest international research findings in the areas of data management and analyzation. IJDWM provides a forum for state-of-the-art developments and research, as well as current innovative activities focusing on the integration between the fields of data warehousing and data mining. Emphasizing applicability to real world problems, this journal meets the needs of both academic researchers and practicing IT professionals.The journal is devoted to the publications of high quality papers on theoretical developments and practical applications in data warehousing and data mining. Original research papers, state-of-the-art reviews, and technical notes are invited for publications. The journal accepts paper submission of any work relevant to data warehousing and data mining. Special attention will be given to papers focusing on mining of data from data warehouses; integration of databases, data warehousing, and data mining; and holistic approaches to mining and archiving