Multi-Density Datasets Clustering Using K-Nearest Neighbors and Chebyshev’s Inequality

IF 2.8 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Informatica Pub Date : 2023-10-06 DOI:10.31449/inf.v47i8.4719

Amira Bouchemal, Mohamed Tahar Kimour

引用次数: 0

Abstract

Density-based clustering techniques are widely used in data mining on various fields. DBSCAN is one of the most popular density-based clustering algorithms, characterized by its ability to discover clusters with different shapes and sizes, and to separate noise and outliers. However, two fundamental limitations are still encountered that is the required input parameter of Eps distance threshold and its inefficiency to cluster datasets with various densities. For overcoming such drawbacks, a statistical based technique is proposed in this work. Specifically, the proposed technique utilizes an appropriate k-nearest neighbor density, based on which it sorts the dataset in ascending order and, using the statistical Chebyshev’s inequality as a suitable means for handling arbitrary distributions, it automatically determines different Eps values for clusters of various densities. Experiments conducted on synthetic and real datasets have demonstrated its efficiency and accuracy. The results indicate its superiority compared with DBSCAN, DPC, and their recently proposed improvements.

查看原文本刊更多论文

基于k近邻和Chebyshev不等式的多密度数据集聚类

基于密度的聚类技术广泛应用于各个领域的数据挖掘。DBSCAN是最流行的基于密度的聚类算法之一，其特点是能够发现不同形状和大小的聚类，并分离噪声和异常值。然而，Eps距离阈值的输入参数要求和对不同密度的数据集聚类效率不高，仍然存在两个基本的局限性。为了克服这些缺点，本文提出了一种基于统计的技术。具体来说，所提出的技术利用适当的k近邻密度，在此基础上按升序对数据集进行排序，并使用统计Chebyshev不等式作为处理任意分布的合适手段，它自动确定不同密度簇的不同Eps值。在合成数据集和真实数据集上进行的实验证明了该方法的有效性和准确性。结果表明，该方法与DBSCAN、DPC及其最近提出的改进方案相比具有优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Informatica 工程技术-计算机：信息系统

CiteScore

5.90

自引率

6.90%

发文量

审稿时长

12 months

期刊介绍： The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.