Harshal Mittal, Jagarlamudi Sai Laxman, Dheeraj Kumar
{"title":"ML-aVAT: A Novel 2-Stage Machine-Learning Approach for Automatic Clustering Tendency Assessment","authors":"Harshal Mittal, Jagarlamudi Sai Laxman, Dheeraj Kumar","doi":"10.1016/j.bdr.2023.100413","DOIUrl":null,"url":null,"abstract":"<div><p>Clustering tendency assessment, which aims to deduce if a dataset contains any cluster structure, and, if it does, how many clusters it has, is a critical problem in exploratory data analysis. The VAT family of algorithms provides a “visual” means to assess the clustering tendency for various datasets. The VAT algorithm operates by reordering the pairwise distance matrix of the input data. When viewed as a monochrome image, this reordered dissimilarity matrix is called a reordered dissimilarity image (RDI), showing possible data clusters by dark blocks along the diagonal. This process, however, requires human intervention to interpret an RDI. Moreover, for datasets having complex cluster structure or noise, dark blocks along the diagonal of the RDI are not easily distinguishable, making it difficult to count them accurately, and different individuals can report different numbers of dark blocks. Only a handful of approaches have been proposed in the literature to automatically (algorithmically) infer the cluster structure from a VAT-type RDI without requiring human input. However, these approaches do not perform well for several data types and have impractically high run-time. This paper proposes and develops ML-aVAT: a novel two-stage machine-learning-based approach for automatic clustering tendency assessment from VAT-type RDI. Besides estimating the number of clusters, ML-aVAT can also infer the clustering hierarchy, i.e., sub-clusters within each group, something none of the previously proposed algorithms could do. Numerical experiments performed on various synthetic and real-life labeled and unlabeled datasets prove the effectiveness of ML-aVAT in estimating clustering tendency and cluster hierarchy.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"34 ","pages":"Article 100413"},"PeriodicalIF":3.5000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579623000461","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering tendency assessment, which aims to deduce if a dataset contains any cluster structure, and, if it does, how many clusters it has, is a critical problem in exploratory data analysis. The VAT family of algorithms provides a “visual” means to assess the clustering tendency for various datasets. The VAT algorithm operates by reordering the pairwise distance matrix of the input data. When viewed as a monochrome image, this reordered dissimilarity matrix is called a reordered dissimilarity image (RDI), showing possible data clusters by dark blocks along the diagonal. This process, however, requires human intervention to interpret an RDI. Moreover, for datasets having complex cluster structure or noise, dark blocks along the diagonal of the RDI are not easily distinguishable, making it difficult to count them accurately, and different individuals can report different numbers of dark blocks. Only a handful of approaches have been proposed in the literature to automatically (algorithmically) infer the cluster structure from a VAT-type RDI without requiring human input. However, these approaches do not perform well for several data types and have impractically high run-time. This paper proposes and develops ML-aVAT: a novel two-stage machine-learning-based approach for automatic clustering tendency assessment from VAT-type RDI. Besides estimating the number of clusters, ML-aVAT can also infer the clustering hierarchy, i.e., sub-clusters within each group, something none of the previously proposed algorithms could do. Numerical experiments performed on various synthetic and real-life labeled and unlabeled datasets prove the effectiveness of ML-aVAT in estimating clustering tendency and cluster hierarchy.
期刊介绍:
The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic.
The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.