基于k近邻和Chebyshev不等式的多密度数据集聚类

IF 3.3 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Amira Bouchemal, Mohamed Tahar Kimour
{"title":"基于k近邻和Chebyshev不等式的多密度数据集聚类","authors":"Amira Bouchemal, Mohamed Tahar Kimour","doi":"10.31449/inf.v47i8.4719","DOIUrl":null,"url":null,"abstract":"Density-based clustering techniques are widely used in data mining on various fields. DBSCAN is one of the most popular density-based clustering algorithms, characterized by its ability to discover clusters with different shapes and sizes, and to separate noise and outliers. However, two fundamental limitations are still encountered that is the required input parameter of Eps distance threshold and its inefficiency to cluster datasets with various densities. For overcoming such drawbacks, a statistical based technique is proposed in this work. Specifically, the proposed technique utilizes an appropriate k-nearest neighbor density, based on which it sorts the dataset in ascending order and, using the statistical Chebyshev’s inequality as a suitable means for handling arbitrary distributions, it automatically determines different Eps values for clusters of various densities. Experiments conducted on synthetic and real datasets have demonstrated its efficiency and accuracy. The results indicate its superiority compared with DBSCAN, DPC, and their recently proposed improvements.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"96 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Density Datasets Clustering Using K-Nearest Neighbors and Chebyshev’s Inequality\",\"authors\":\"Amira Bouchemal, Mohamed Tahar Kimour\",\"doi\":\"10.31449/inf.v47i8.4719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Density-based clustering techniques are widely used in data mining on various fields. DBSCAN is one of the most popular density-based clustering algorithms, characterized by its ability to discover clusters with different shapes and sizes, and to separate noise and outliers. However, two fundamental limitations are still encountered that is the required input parameter of Eps distance threshold and its inefficiency to cluster datasets with various densities. For overcoming such drawbacks, a statistical based technique is proposed in this work. Specifically, the proposed technique utilizes an appropriate k-nearest neighbor density, based on which it sorts the dataset in ascending order and, using the statistical Chebyshev’s inequality as a suitable means for handling arbitrary distributions, it automatically determines different Eps values for clusters of various densities. Experiments conducted on synthetic and real datasets have demonstrated its efficiency and accuracy. The results indicate its superiority compared with DBSCAN, DPC, and their recently proposed improvements.\",\"PeriodicalId\":56292,\"journal\":{\"name\":\"Informatica\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2023-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatica\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31449/inf.v47i8.4719\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i8.4719","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

基于密度的聚类技术广泛应用于各个领域的数据挖掘。DBSCAN是最流行的基于密度的聚类算法之一,其特点是能够发现不同形状和大小的聚类,并分离噪声和异常值。然而,Eps距离阈值的输入参数要求和对不同密度的数据集聚类效率不高,仍然存在两个基本的局限性。为了克服这些缺点,本文提出了一种基于统计的技术。具体来说,所提出的技术利用适当的k近邻密度,在此基础上按升序对数据集进行排序,并使用统计Chebyshev不等式作为处理任意分布的合适手段,它自动确定不同密度簇的不同Eps值。在合成数据集和真实数据集上进行的实验证明了该方法的有效性和准确性。结果表明,该方法与DBSCAN、DPC及其最近提出的改进方案相比具有优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-Density Datasets Clustering Using K-Nearest Neighbors and Chebyshev’s Inequality
Density-based clustering techniques are widely used in data mining on various fields. DBSCAN is one of the most popular density-based clustering algorithms, characterized by its ability to discover clusters with different shapes and sizes, and to separate noise and outliers. However, two fundamental limitations are still encountered that is the required input parameter of Eps distance threshold and its inefficiency to cluster datasets with various densities. For overcoming such drawbacks, a statistical based technique is proposed in this work. Specifically, the proposed technique utilizes an appropriate k-nearest neighbor density, based on which it sorts the dataset in ascending order and, using the statistical Chebyshev’s inequality as a suitable means for handling arbitrary distributions, it automatically determines different Eps values for clusters of various densities. Experiments conducted on synthetic and real datasets have demonstrated its efficiency and accuracy. The results indicate its superiority compared with DBSCAN, DPC, and their recently proposed improvements.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Informatica
Informatica 工程技术-计算机:信息系统
CiteScore
5.90
自引率
6.90%
发文量
19
审稿时长
12 months
期刊介绍: The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信