How to visualize high-dimensional data

IF 5.6 2区 医学 Q1 PHYSIOLOGY
Ralf Mrowka, Ralf Schmauder
{"title":"How to visualize high-dimensional data","authors":"Ralf Mrowka,&nbsp;Ralf Schmauder","doi":"10.1111/apha.14219","DOIUrl":null,"url":null,"abstract":"<p>Recently, a colleague asked after a lecture about a fancy diagram where the axis designation was not clear to him and the discussion about that raised a few interesting thoughts about that specific matter. Physiological knowledge is often taught at university seminars and in textbooks with the help of diagrams. A very important first step when discussing diagrams is to clarify which physical, physiological variable at what scale and unit is represented on which axis. Examples of typical classical low dimensional diagrams in physiology publications in Acta Physiologica might be blood pressure over time,<span><sup>1</sup></span> infarct size as percentage of Left ventricular mass depending on genotype<span><sup>2</sup></span> or urine excretion in volume per time depending on diet.<span><sup>3</sup></span> Not knowing the axes of the classical diagrams, they might as well be “just” pieces of fancy modern art.</p><p>We strongly believe that graphical representation of complex data—for example, as diagrams—is essential in communicating them. However, for specific types of diagrams, the understanding and interpretation of their content is more complex, and requires more explanation than classical diagrams. Specifically, we refer to the graphical representation of high-dimensional data, which have, in recent years, played an increasing role in new understandings of physiological processes.</p><p>To visualize data a reduction of dimensionality is often applied. A simple example is a black/white photograph of a colorful moving three dimensional object. The snapshot “eliminated” the dimension time and the optical projection on a plane in the camera eliminated one dimension in space and the gray values just reduced the spectral information to an intensity value on the photograph. Although the photograph does not represent the compete “dataset” it gives us in most cases a good impression about the situation captured by the photographer.</p><p>Times have changed.</p><p>To describe the “amount” of data obtained for a study in the 1960s one physiologist for example referred to the length of the paper of plots of curved of blood pressure measurements he was analyzing for one particular study. Compared with that amount of data back then we are nowadays faced with a completely new situation. With the development of technology we have to handle a huge amount of data today. For example, in recent studies with single RNAseq data scientists obtained with thousands of expression values for single genes for each of thousands of single cells at multiple experimental points and possibly for multiple interventions. Obviously you cannot produce a meaningful simple classical plot with thousands of dimensions.</p><p>In order to make sense out of the hugely dimensional data, researches can employ methods for the reduction of dimensionality. One classical methods would be to employ the so called principal component analysis (PCA). This linear method projects the data onto a new coordinate system where the axes (principal components) are the directions of maximum variance. Without diving too deep into the mathematics, this is done by calculating the so called eigenvectors of the data matrix.</p><p>It turned out that this PCA method with its linear transformation is not sufficient in the above described case. Here new statistical approaches for nonlinear dimensionality reduction have been developed. One is the so called t-Distributed Stochastic Neighbor Embedding (t-SNE) method<span><sup>4</sup></span> and later a kind of improved version of this is the so called Uniform Manifold Approximation and Projection (UMAP).<span><sup>5</sup></span> Both dimensionality reduction techniques often used for visualizing high-dimensional data. One of the features regarding the axes is that, unlike typical plots with well-defined physical quantities such as blood pressure or voltage or time on the axes, t-SNE and UMAP plots have more abstract interpretations. The positions of the points on the plot reflect the (probability)-relationships between the individual data points in the original high-dimensional space, whereby we can find clustered clouds. The points in that clouds that are close together on the plot are considered similar or closely related in the high-dimensional space. When comparing t-SNE and UMAP, we can differentiate between so called local structure and global structure of the points in the diagram. The t-SNE method preserves local structure of the data, that is, close points in the diagram, for example, in a cluster represent close point in the higher dimensional space. However the global structure is not accurately preserved with the t-SNE method. Here is the main point where the improved UMAP method comes into play. The UMAP aims to preserve both local and global data structures more effectively than the t-SNE method. Therefore, the UMAP method might be more suitable for analysis of data where this feature of global topology is relevant for interpretation. This might be the case for example for high-dimensional lineage analysis in stem cell maturation. To illustrate their performance we have generated three datasets and visualize them with all the tree methods (Figure 2). With t-SNE and UMAP representation it is also possible to perform statistical analysis.<span><sup>6</sup></span> The UMAP method is widely used in high-dimensional data in the field of neurophysiology,<span><sup>7</sup></span> immunology,<span><sup>8</sup></span> cancer<span><sup>9, 10</sup></span> and infectious diseases like COVID-19.<span><sup>11</sup></span></p><p>Taken together, both nonlinear methods have been applied to high-dimensional data so far. Clusters in the data might be revealed and their usage in the literature has been growing steadily in the scientific literature since the publication of the methods (Figure 1). The meaning of the axes of t-SNE and UMAP plot is not straightforward as compared to more traditional diagrams used in physiology, for example, the spread of points showing the variance of what? The mapping of the data points into the diagram is a nonlinear transformation while on the local scales the neighboring relationships are aimed to preserve. The UMAP might be better in preserving global scales; however, one might be cautious in the interpretation of the distances of the clusters for both the t-SNE and the UMAP methods due to the nonlinear transformations. There are efforts to further improve those methods.<span><sup>12</sup></span> We will see whether they will find resonance and application the scientific community.</p><p>To come back to the beginning: please remember to take a second to explain what your high-dimensional plots show in your presentation. As physiology is a rather interdisciplinary and methods develop rapidly, likely part of your audience might otherwise be just admiring beautiful graphic and not your data.</p><p>RM did the statistics on pubmed occurences. RM and RS did the data anaysis for figure 2. RM wrote the initial text. Both authors edited the final version of the manuscript.</p><p>None.</p>","PeriodicalId":107,"journal":{"name":"Acta Physiologica","volume":"240 10","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/apha.14219","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Physiologica","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/apha.14219","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHYSIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, a colleague asked after a lecture about a fancy diagram where the axis designation was not clear to him and the discussion about that raised a few interesting thoughts about that specific matter. Physiological knowledge is often taught at university seminars and in textbooks with the help of diagrams. A very important first step when discussing diagrams is to clarify which physical, physiological variable at what scale and unit is represented on which axis. Examples of typical classical low dimensional diagrams in physiology publications in Acta Physiologica might be blood pressure over time,1 infarct size as percentage of Left ventricular mass depending on genotype2 or urine excretion in volume per time depending on diet.3 Not knowing the axes of the classical diagrams, they might as well be “just” pieces of fancy modern art.

We strongly believe that graphical representation of complex data—for example, as diagrams—is essential in communicating them. However, for specific types of diagrams, the understanding and interpretation of their content is more complex, and requires more explanation than classical diagrams. Specifically, we refer to the graphical representation of high-dimensional data, which have, in recent years, played an increasing role in new understandings of physiological processes.

To visualize data a reduction of dimensionality is often applied. A simple example is a black/white photograph of a colorful moving three dimensional object. The snapshot “eliminated” the dimension time and the optical projection on a plane in the camera eliminated one dimension in space and the gray values just reduced the spectral information to an intensity value on the photograph. Although the photograph does not represent the compete “dataset” it gives us in most cases a good impression about the situation captured by the photographer.

Times have changed.

To describe the “amount” of data obtained for a study in the 1960s one physiologist for example referred to the length of the paper of plots of curved of blood pressure measurements he was analyzing for one particular study. Compared with that amount of data back then we are nowadays faced with a completely new situation. With the development of technology we have to handle a huge amount of data today. For example, in recent studies with single RNAseq data scientists obtained with thousands of expression values for single genes for each of thousands of single cells at multiple experimental points and possibly for multiple interventions. Obviously you cannot produce a meaningful simple classical plot with thousands of dimensions.

In order to make sense out of the hugely dimensional data, researches can employ methods for the reduction of dimensionality. One classical methods would be to employ the so called principal component analysis (PCA). This linear method projects the data onto a new coordinate system where the axes (principal components) are the directions of maximum variance. Without diving too deep into the mathematics, this is done by calculating the so called eigenvectors of the data matrix.

It turned out that this PCA method with its linear transformation is not sufficient in the above described case. Here new statistical approaches for nonlinear dimensionality reduction have been developed. One is the so called t-Distributed Stochastic Neighbor Embedding (t-SNE) method4 and later a kind of improved version of this is the so called Uniform Manifold Approximation and Projection (UMAP).5 Both dimensionality reduction techniques often used for visualizing high-dimensional data. One of the features regarding the axes is that, unlike typical plots with well-defined physical quantities such as blood pressure or voltage or time on the axes, t-SNE and UMAP plots have more abstract interpretations. The positions of the points on the plot reflect the (probability)-relationships between the individual data points in the original high-dimensional space, whereby we can find clustered clouds. The points in that clouds that are close together on the plot are considered similar or closely related in the high-dimensional space. When comparing t-SNE and UMAP, we can differentiate between so called local structure and global structure of the points in the diagram. The t-SNE method preserves local structure of the data, that is, close points in the diagram, for example, in a cluster represent close point in the higher dimensional space. However the global structure is not accurately preserved with the t-SNE method. Here is the main point where the improved UMAP method comes into play. The UMAP aims to preserve both local and global data structures more effectively than the t-SNE method. Therefore, the UMAP method might be more suitable for analysis of data where this feature of global topology is relevant for interpretation. This might be the case for example for high-dimensional lineage analysis in stem cell maturation. To illustrate their performance we have generated three datasets and visualize them with all the tree methods (Figure 2). With t-SNE and UMAP representation it is also possible to perform statistical analysis.6 The UMAP method is widely used in high-dimensional data in the field of neurophysiology,7 immunology,8 cancer9, 10 and infectious diseases like COVID-19.11

Taken together, both nonlinear methods have been applied to high-dimensional data so far. Clusters in the data might be revealed and their usage in the literature has been growing steadily in the scientific literature since the publication of the methods (Figure 1). The meaning of the axes of t-SNE and UMAP plot is not straightforward as compared to more traditional diagrams used in physiology, for example, the spread of points showing the variance of what? The mapping of the data points into the diagram is a nonlinear transformation while on the local scales the neighboring relationships are aimed to preserve. The UMAP might be better in preserving global scales; however, one might be cautious in the interpretation of the distances of the clusters for both the t-SNE and the UMAP methods due to the nonlinear transformations. There are efforts to further improve those methods.12 We will see whether they will find resonance and application the scientific community.

To come back to the beginning: please remember to take a second to explain what your high-dimensional plots show in your presentation. As physiology is a rather interdisciplinary and methods develop rapidly, likely part of your audience might otherwise be just admiring beautiful graphic and not your data.

RM did the statistics on pubmed occurences. RM and RS did the data anaysis for figure 2. RM wrote the initial text. Both authors edited the final version of the manuscript.

None.

Abstract Image

如何将高维数据可视化。
最近,一位同事在听完讲座后问起一张花哨的图,他不清楚图中轴线的标示。在大学研讨会和教科书中,生理学知识往往是借助图表来传授的。在讨论图表时,非常重要的第一步就是要明确哪个物理、生理变量在哪个坐标轴上以什么比例和单位表示。在《生理学报》(Acta Physiologica)上发表的生理学文章中,典型的低维度图表可能是随时间变化的血压1、根据基因型以左心室质量百分比表示的梗死面积2 或根据饮食以每次排尿量表示的尿量3。然而,对于特定类型的图表,对其内容的理解和解释要比经典图表更加复杂,也需要更多的解释。具体来说,我们指的是高维数据的图表表示,近年来,高维数据在新的生理过程理解中发挥着越来越重要的作用。一个简单的例子是一张黑白照片,照片中是一个色彩斑斓的三维运动物体。快照 "消除 "了时间维度,相机中平面上的光学投影消除了空间维度,灰度值只是将光谱信息简化为照片上的强度值。尽管照片并不代表完整的 "数据集",但在大多数情况下,它能让我们对摄影师捕捉到的情况有一个良好的印象。与当时的数据量相比,我们现在面临的是一个全新的局面。随着技术的发展,我们今天必须处理大量的数据。例如,在最近使用单个 RNAseq 数据进行的研究中,科学家们获得了数千个单细胞中每个单基因在多个实验点的数千个表达值,而且可能是多个干预措施的表达值。很明显,你不可能用数千个维度绘制出有意义的简单经典图谱。为了从高维数据中找出意义,研究人员可以采用降维方法。一种经典的方法是采用所谓的主成分分析法(PCA)。这种线性方法将数据投影到一个新的坐标系上,坐标系的坐标轴(主成分)是方差最大的方向。在不深入研究数学的情况下,这种方法是通过计算数据矩阵的所谓特征向量来实现的。事实证明,这种 PCA 方法及其线性变换在上述情况下是不够的。在这种情况下,新的非线性降维统计方法应运而生。一种是所谓的 t 分布随机邻域嵌入(t-SNE)方法4 ,后来又出现了一种改进版本,即所谓的均匀曲面逼近和投影(UMAP)5。这两种降维技术通常用于高维数据的可视化。关于坐标轴的一个特点是,与坐标轴上明确定义的物理量(如血压、电压或时间)的典型图不同,t-SNE 和 UMAP 图具有更抽象的解释。图上各点的位置反映了原始高维空间中各个数据点之间的(概率)关系,我们可以据此找到聚类云。云中在图上靠近的点被认为在高维空间中相似或密切相关。在比较 t-SNE 和 UMAP 时,我们可以区分图中点的所谓局部结构和全局结构。t-SNE 方法保留了数据的局部结构,即图中相近的点,例如聚类中的点,代表了高维空间中相近的点。然而,t-SNE 方法并不能准确地保留全局结构。这就是改进后的 UMAP 方法发挥作用的关键所在。与 t-SNE 方法相比,UMAP 能够更有效地保留局部和全局数据结构。因此,UMAP 方法可能更适合分析全局拓扑特征与解释相关的数据。例如,干细胞成熟过程中的高维系谱分析可能就是这种情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Acta Physiologica
Acta Physiologica 医学-生理学
CiteScore
11.80
自引率
15.90%
发文量
182
审稿时长
4-8 weeks
期刊介绍: Acta Physiologica is an important forum for the publication of high quality original research in physiology and related areas by authors from all over the world. Acta Physiologica is a leading journal in human/translational physiology while promoting all aspects of the science of physiology. The journal publishes full length original articles on important new observations as well as reviews and commentaries.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信