Towards a comprehensive visualisation of structure in large scale data sets

IF 6.3 2区 物理与天体物理 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Joan Garriga and Frederic Bartumeus
{"title":"Towards a comprehensive visualisation of structure in large scale data sets","authors":"Joan Garriga and Frederic Bartumeus","doi":"10.1088/2632-2153/ad6fea","DOIUrl":null,"url":null,"abstract":"Dimensionality reduction methods are fundamental to the exploration and visualisation of large data sets. Basic requirements for unsupervised data exploration are flexibility and scalability. However, current methods have computational limitations that restrict our ability to explore data structures to the lower range of scales. We focus on t-SNE and propose a chunk-and-mix protocol that enables the parallel implementation of this algorithm, as well as a self-adaptive parametric scheme that facilitates its parametric configuration. As a proof of concept, we present the pt-SNE algorithm, a parallel version of Barnes-Hat-SNE (an implementation of t-SNE). In pt-SNE, a single free parameter for the size of the neighbourhood, namely the perplexity, modulates the visualisation of the data structure at different scales, from local to global. Thanks to parallelisation, the runtime of the algorithm remains almost independent of the perplexity, which extends the range of scales to be analysed. The pt-SNE converges to a good global embedding comparable to current solutions, although it adds little noise at the local scale. This noise illustrates an unavoidable trade-off between computational speed and accuracy. We expect the same approach to be applicable to faster embedding algorithms than Barnes-Hat-SNE, such as Fast-Fourier Interpolation-based t-SNE or Uniform Manifold Approximation and Projection, thus extending the state of the art and allowing a more comprehensive visualisation and analysis of data structures.","PeriodicalId":33757,"journal":{"name":"Machine Learning Science and Technology","volume":null,"pages":null},"PeriodicalIF":6.3000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning Science and Technology","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1088/2632-2153/ad6fea","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Dimensionality reduction methods are fundamental to the exploration and visualisation of large data sets. Basic requirements for unsupervised data exploration are flexibility and scalability. However, current methods have computational limitations that restrict our ability to explore data structures to the lower range of scales. We focus on t-SNE and propose a chunk-and-mix protocol that enables the parallel implementation of this algorithm, as well as a self-adaptive parametric scheme that facilitates its parametric configuration. As a proof of concept, we present the pt-SNE algorithm, a parallel version of Barnes-Hat-SNE (an implementation of t-SNE). In pt-SNE, a single free parameter for the size of the neighbourhood, namely the perplexity, modulates the visualisation of the data structure at different scales, from local to global. Thanks to parallelisation, the runtime of the algorithm remains almost independent of the perplexity, which extends the range of scales to be analysed. The pt-SNE converges to a good global embedding comparable to current solutions, although it adds little noise at the local scale. This noise illustrates an unavoidable trade-off between computational speed and accuracy. We expect the same approach to be applicable to faster embedding algorithms than Barnes-Hat-SNE, such as Fast-Fourier Interpolation-based t-SNE or Uniform Manifold Approximation and Projection, thus extending the state of the art and allowing a more comprehensive visualisation and analysis of data structures.
实现大规模数据集结构的全面可视化
降维方法是探索和可视化大型数据集的基础。无监督数据探索的基本要求是灵活性和可扩展性。然而,目前的方法存在计算上的局限性,将我们探索数据结构的能力限制在较低的尺度范围内。我们将重点放在 t-SNE 上,并提出了一种分块混合协议(chunk-and-mix protocol)来并行执行该算法,以及一种自适应参数方案(self-adaptive parametric scheme)来简化参数配置。作为概念验证,我们提出了pt-SNE 算法,它是 Barnes-Hat-SNE(t-SNE 的实现)的并行版本。在 pt-SNE 算法中,邻域大小的一个自由参数,即困惑度,可以调节从局部到全局等不同尺度的数据结构的可视化。得益于并行化,该算法的运行时间几乎不受复杂度的影响,从而扩大了可分析的尺度范围。尽管 pt-SNE 算法在局部尺度上几乎没有增加噪音,但它收敛到了与当前解决方案相当的良好全局嵌入。这种噪音说明了计算速度和精确度之间不可避免的权衡。我们希望同样的方法也能适用于比 Barnes-Hat-SNE 更快的嵌入算法,如基于快速傅立叶插值的 t-SNE 或均匀曲面逼近和投影,从而拓展技术领域,实现更全面的数据结构可视化和分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Machine Learning Science and Technology
Machine Learning Science and Technology Computer Science-Artificial Intelligence
CiteScore
9.10
自引率
4.40%
发文量
86
审稿时长
5 weeks
期刊介绍: Machine Learning Science and Technology is a multidisciplinary open access journal that bridges the application of machine learning across the sciences with advances in machine learning methods and theory as motivated by physical insights. Specifically, articles must fall into one of the following categories: advance the state of machine learning-driven applications in the sciences or make conceptual, methodological or theoretical advances in machine learning with applications to, inspiration from, or motivated by scientific problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信