ML-aVAT: A Novel 2-Stage Machine-Learning Approach for Automatic Clustering Tendency Assessment

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2023-10-31 DOI:10.1016/j.bdr.2023.100413

Harshal Mittal, Jagarlamudi Sai Laxman, Dheeraj Kumar

{"title":"ML-aVAT: A Novel 2-Stage Machine-Learning Approach for Automatic Clustering Tendency Assessment","authors":"Harshal Mittal, Jagarlamudi Sai Laxman, Dheeraj Kumar","doi":"10.1016/j.bdr.2023.100413","DOIUrl":null,"url":null,"abstract":"<div><p>Clustering tendency assessment, which aims to deduce if a dataset contains any cluster structure, and, if it does, how many clusters it has, is a critical problem in exploratory data analysis. The VAT family of algorithms provides a “visual” means to assess the clustering tendency for various datasets. The VAT algorithm operates by reordering the pairwise distance matrix of the input data. When viewed as a monochrome image, this reordered dissimilarity matrix is called a reordered dissimilarity image (RDI), showing possible data clusters by dark blocks along the diagonal. This process, however, requires human intervention to interpret an RDI. Moreover, for datasets having complex cluster structure or noise, dark blocks along the diagonal of the RDI are not easily distinguishable, making it difficult to count them accurately, and different individuals can report different numbers of dark blocks. Only a handful of approaches have been proposed in the literature to automatically (algorithmically) infer the cluster structure from a VAT-type RDI without requiring human input. However, these approaches do not perform well for several data types and have impractically high run-time. This paper proposes and develops ML-aVAT: a novel two-stage machine-learning-based approach for automatic clustering tendency assessment from VAT-type RDI. Besides estimating the number of clusters, ML-aVAT can also infer the clustering hierarchy, i.e., sub-clusters within each group, something none of the previously proposed algorithms could do. Numerical experiments performed on various synthetic and real-life labeled and unlabeled datasets prove the effectiveness of ML-aVAT in estimating clustering tendency and cluster hierarchy.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"34 ","pages":"Article 100413"},"PeriodicalIF":4.2000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579623000461","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering tendency assessment, which aims to deduce if a dataset contains any cluster structure, and, if it does, how many clusters it has, is a critical problem in exploratory data analysis. The VAT family of algorithms provides a “visual” means to assess the clustering tendency for various datasets. The VAT algorithm operates by reordering the pairwise distance matrix of the input data. When viewed as a monochrome image, this reordered dissimilarity matrix is called a reordered dissimilarity image (RDI), showing possible data clusters by dark blocks along the diagonal. This process, however, requires human intervention to interpret an RDI. Moreover, for datasets having complex cluster structure or noise, dark blocks along the diagonal of the RDI are not easily distinguishable, making it difficult to count them accurately, and different individuals can report different numbers of dark blocks. Only a handful of approaches have been proposed in the literature to automatically (algorithmically) infer the cluster structure from a VAT-type RDI without requiring human input. However, these approaches do not perform well for several data types and have impractically high run-time. This paper proposes and develops ML-aVAT: a novel two-stage machine-learning-based approach for automatic clustering tendency assessment from VAT-type RDI. Besides estimating the number of clusters, ML-aVAT can also infer the clustering hierarchy, i.e., sub-clusters within each group, something none of the previously proposed algorithms could do. Numerical experiments performed on various synthetic and real-life labeled and unlabeled datasets prove the effectiveness of ML-aVAT in estimating clustering tendency and cluster hierarchy.

查看原文本刊更多论文

ML-aVAT:一种新的两阶段机器学习方法用于自动聚类倾向评估

聚类倾向评估是探索性数据分析中的一个关键问题，它旨在推断数据集是否包含任何聚类结构，如果包含，它有多少聚类。VAT系列算法提供了一种“可视化”的方法来评估各种数据集的聚类趋势。VAT算法通过对输入数据的成对距离矩阵重新排序来操作。当被视为单色图像时，这个重新排序的不相似矩阵被称为重新排序的不相似图像(RDI)，通过对角线上的深色块显示可能的数据簇。然而，这个过程需要人工干预来解释RDI。此外，对于具有复杂聚类结构或噪声的数据集，RDI对角线上的暗块不易区分，难以准确计数，不同个体报告的暗块数量也不同。文献中只提出了几种方法来自动(算法地)从vat类型的RDI推断集群结构，而不需要人工输入。然而，这些方法在一些数据类型上表现不佳，并且运行时高得不切实际。本文提出并发展了一种新的基于两阶段机器学习的基于vat类型RDI的自动聚类倾向评估方法ML-aVAT。除了估计聚类的数量外，ML-aVAT还可以推断聚类层次结构，即每个组内的子聚类，这是以前提出的算法无法做到的。在各种合成和真实的标记和未标记数据集上进行的数值实验证明了ML-aVAT在估计聚类倾向和聚类层次方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.