Towards a comprehensive visualisation of structure in large scale data sets

IF 6.3 2区物理与天体物理 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning Science and Technology Pub Date : 2024-09-09 DOI:10.1088/2632-2153/ad6fea

Joan Garriga and Frederic Bartumeus

{"title":"Towards a comprehensive visualisation of structure in large scale data sets","authors":"Joan Garriga and Frederic Bartumeus","doi":"10.1088/2632-2153/ad6fea","DOIUrl":null,"url":null,"abstract":"Dimensionality reduction methods are fundamental to the exploration and visualisation of large data sets. Basic requirements for unsupervised data exploration are flexibility and scalability. However, current methods have computational limitations that restrict our ability to explore data structures to the lower range of scales. We focus on t-SNE and propose a chunk-and-mix protocol that enables the parallel implementation of this algorithm, as well as a self-adaptive parametric scheme that facilitates its parametric configuration. As a proof of concept, we present the pt-SNE algorithm, a parallel version of Barnes-Hat-SNE (an implementation of t-SNE). In pt-SNE, a single free parameter for the size of the neighbourhood, namely the perplexity, modulates the visualisation of the data structure at different scales, from local to global. Thanks to parallelisation, the runtime of the algorithm remains almost independent of the perplexity, which extends the range of scales to be analysed. The pt-SNE converges to a good global embedding comparable to current solutions, although it adds little noise at the local scale. This noise illustrates an unavoidable trade-off between computational speed and accuracy. We expect the same approach to be applicable to faster embedding algorithms than Barnes-Hat-SNE, such as Fast-Fourier Interpolation-based t-SNE or Uniform Manifold Approximation and Projection, thus extending the state of the art and allowing a more comprehensive visualisation and analysis of data structures.","PeriodicalId":33757,"journal":{"name":"Machine Learning Science and Technology","volume":"30 1","pages":""},"PeriodicalIF":6.3000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning Science and Technology","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1088/2632-2153/ad6fea","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Dimensionality reduction methods are fundamental to the exploration and visualisation of large data sets. Basic requirements for unsupervised data exploration are flexibility and scalability. However, current methods have computational limitations that restrict our ability to explore data structures to the lower range of scales. We focus on t-SNE and propose a chunk-and-mix protocol that enables the parallel implementation of this algorithm, as well as a self-adaptive parametric scheme that facilitates its parametric configuration. As a proof of concept, we present the pt-SNE algorithm, a parallel version of Barnes-Hat-SNE (an implementation of t-SNE). In pt-SNE, a single free parameter for the size of the neighbourhood, namely the perplexity, modulates the visualisation of the data structure at different scales, from local to global. Thanks to parallelisation, the runtime of the algorithm remains almost independent of the perplexity, which extends the range of scales to be analysed. The pt-SNE converges to a good global embedding comparable to current solutions, although it adds little noise at the local scale. This noise illustrates an unavoidable trade-off between computational speed and accuracy. We expect the same approach to be applicable to faster embedding algorithms than Barnes-Hat-SNE, such as Fast-Fourier Interpolation-based t-SNE or Uniform Manifold Approximation and Projection, thus extending the state of the art and allowing a more comprehensive visualisation and analysis of data structures.

查看原文本刊更多论文

实现大规模数据集结构的全面可视化

降维方法是探索和可视化大型数据集的基础。无监督数据探索的基本要求是灵活性和可扩展性。然而，目前的方法存在计算上的局限性，将我们探索数据结构的能力限制在较低的尺度范围内。我们将重点放在 t-SNE 上，并提出了一种分块混合协议（chunk-and-mix protocol）来并行执行该算法，以及一种自适应参数方案（self-adaptive parametric scheme）来简化参数配置。作为概念验证，我们提出了pt-SNE 算法，它是 Barnes-Hat-SNE（t-SNE 的实现）的并行版本。在 pt-SNE 算法中，邻域大小的一个自由参数，即困惑度，可以调节从局部到全局等不同尺度的数据结构的可视化。得益于并行化，该算法的运行时间几乎不受复杂度的影响，从而扩大了可分析的尺度范围。尽管 pt-SNE 算法在局部尺度上几乎没有增加噪音，但它收敛到了与当前解决方案相当的良好全局嵌入。这种噪音说明了计算速度和精确度之间不可避免的权衡。我们希望同样的方法也能适用于比 Barnes-Hat-SNE 更快的嵌入算法，如基于快速傅立叶插值的 t-SNE 或均匀曲面逼近和投影，从而拓展技术领域，实现更全面的数据结构可视化和分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Learning Science and Technology Computer Science-Artificial Intelligence

CiteScore

9.10

自引率

4.40%

发文量

审稿时长

5 weeks

期刊介绍： Machine Learning Science and Technology is a multidisciplinary open access journal that bridges the application of machine learning across the sciences with advances in machine learning methods and theory as motivated by physical insights. Specifically, articles must fall into one of the following categories: advance the state of machine learning-driven applications in the sciences or make conceptual, methodological or theoretical advances in machine learning with applications to, inspiration from, or motivated by scientific problems.