A comparative user study of visualization techniques for cluster analysis of multidimensional data sets

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Information Visualization Pub Date : 2020-07-04 DOI:10.1177/1473871620922166

E. Ventocilla, M. Riveiro

{"title":"A comparative user study of visualization techniques for cluster analysis of multidimensional data sets","authors":"E. Ventocilla, M. Riveiro","doi":"10.1177/1473871620922166","DOIUrl":null,"url":null,"abstract":"This article presents an empirical user study that compares eight multidimensional projection techniques for supporting the estimation of the number of clusters, k , embedded in six multidimensional data sets. The selection of the techniques was based on their intended design, or use, for visually encoding data structures, that is, neighborhood relations between data points or groups of data points in a data set. Concretely, we study: the difference between the estimates of k as given by participants when using different multidimensional projections; the accuracy of user estimations with respect to the number of labels in the data sets; the perceived usability of each multidimensional projection; whether user estimates disagree with k values given by a set of cluster quality measures; and whether there is a difference between experienced and novice users in terms of estimates and perceived usability. The results show that: dendrograms (from Ward’s hierarchical clustering) are likely to lead to estimates of k that are different from those given with other multidimensional projections, while Star Coordinates and Radial Visualizations are likely to lead to similar estimates; t-Stochastic Neighbor Embedding is likely to lead to estimates which are closer to the number of labels in a data set; cluster quality measures are likely to produce estimates which are different from those given by users using Ward and t-Stochastic Neighbor Embedding; U-Matrices and reachability plots will likely have a low perceived usability; and there is no statistically significant difference between the answers of experienced and novice users. Moreover, as data dimensionality increases, cluster quality measures are likely to produce estimates which are different from those perceived by users using any of the assessed multidimensional projections. It is also apparent that the inherent complexity of a data set, as well as the capability of each visual technique to disclose such complexity, has an influence on the perceived usability.","PeriodicalId":50360,"journal":{"name":"Information Visualization","volume":"19 1","pages":"318 - 338"},"PeriodicalIF":1.8000,"publicationDate":"2020-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1473871620922166","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Visualization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/1473871620922166","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 9

Abstract

This article presents an empirical user study that compares eight multidimensional projection techniques for supporting the estimation of the number of clusters, k , embedded in six multidimensional data sets. The selection of the techniques was based on their intended design, or use, for visually encoding data structures, that is, neighborhood relations between data points or groups of data points in a data set. Concretely, we study: the difference between the estimates of k as given by participants when using different multidimensional projections; the accuracy of user estimations with respect to the number of labels in the data sets; the perceived usability of each multidimensional projection; whether user estimates disagree with k values given by a set of cluster quality measures; and whether there is a difference between experienced and novice users in terms of estimates and perceived usability. The results show that: dendrograms (from Ward’s hierarchical clustering) are likely to lead to estimates of k that are different from those given with other multidimensional projections, while Star Coordinates and Radial Visualizations are likely to lead to similar estimates; t-Stochastic Neighbor Embedding is likely to lead to estimates which are closer to the number of labels in a data set; cluster quality measures are likely to produce estimates which are different from those given by users using Ward and t-Stochastic Neighbor Embedding; U-Matrices and reachability plots will likely have a low perceived usability; and there is no statistically significant difference between the answers of experienced and novice users. Moreover, as data dimensionality increases, cluster quality measures are likely to produce estimates which are different from those perceived by users using any of the assessed multidimensional projections. It is also apparent that the inherent complexity of a data set, as well as the capability of each visual technique to disclose such complexity, has an influence on the perceived usability.

查看原文本刊更多论文

多维数据集聚类分析可视化技术的用户比较研究

本文提出了一项经验用户研究，比较了八种多维投影技术，用于支持估计嵌入在六个多维数据集中的聚类k的数量。技术的选择是基于其预期的设计或使用，用于可视化编码数据结构，即数据集中数据点或数据点组之间的邻域关系。具体来说，我们研究了:在使用不同的多维投影时，参与者给出的k的估计值之间的差异;用户估计相对于数据集中标签数量的准确性;每个多维投影的感知可用性;用户估计是否与一组聚类质量度量给出的k值不一致;以及经验丰富的用户和新手用户在评估和感知可用性方面是否存在差异。结果表明:树形图(来自Ward的分层聚类)可能导致k的估计与其他多维投影的估计不同，而星坐标和径向可视化可能导致类似的估计;t-随机邻居嵌入可能会导致更接近数据集中标签数量的估计;聚类质量度量可能产生与使用Ward和t-随机邻居嵌入的用户给出的估计不同的估计;u矩阵和可达性图可能具有较低的感知可用性;经验丰富的用户和新手用户的回答没有统计学上的显著差异。此外，随着数据维度的增加，聚类质量度量可能产生的估计值与用户使用任何评估的多维预测所感知到的估计值不同。同样明显的是，数据集的固有复杂性，以及每种可视化技术揭示这种复杂性的能力，对感知的可用性有影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Visualization COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

5.40

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Information Visualization is essential reading for researchers and practitioners of information visualization and is of interest to computer scientists and data analysts working on related specialisms. This journal is an international, peer-reviewed journal publishing articles on fundamental research and applications of information visualization. The journal acts as a dedicated forum for the theories, methodologies, techniques and evaluations of information visualization and its applications. The journal is a core vehicle for developing a generic research agenda for the field by identifying and developing the unique and significant aspects of information visualization. Emphasis is placed on interdisciplinary material and on the close connection between theory and practice. This journal is a member of the Committee on Publication Ethics (COPE).