Danny Salem, Anuradha Surendra, Graeme SV McDowell, Miroslava Cuperlovic-Culf
{"title":"投影统计 ProST 数据投影分析中分组分离的在线统计评估","authors":"Danny Salem, Anuradha Surendra, Graeme SV McDowell, Miroslava Cuperlovic-Culf","doi":"10.1101/2024.09.04.611273","DOIUrl":null,"url":null,"abstract":"Motivation: Unsupervised data projection for the determination of trends in the data, visualization of multidimensional data in a reduced dimension space or feature space reduction through combination of data is a major step in data mining. Methods such as Principal Component Analysis or t-Distribution Stochastic Neighbor Embedding are regularly used as one of the first steps in computational biology or omics investigation. However, the significance of the separation of sample groups by these methods generally relies on visual assessment. User-friendly application for different projection methods, each focusing on distinct data properties, are needed as well as a rigorous method for statistical determination of the significance of separation of groups of interest in each dataset.\nResults: We present Projection STatistics (ProST), a user-friendly solution for data projection analysis providing three unsupervised (PCA, t-SNE and UMAP) and one supervised (LDA) approach. For each method we are including a novel statistical investigation of the significance of group separation with Mann-Whitney U-rank or t-test analysis as well as necessary preprocessing steps. ProST provides an unbiased, objective application of the determination of the significance of the separation of measurement groups through either linear or manifold projection analysis with methods ranging from a focus on the separation of points based on major variances or on point proximity based on distance.\nAvailability: The ProST software application is freely available at https://complimet.ca/shiny/ProST/ with source code provided on https://github.com/complimet/prost.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Projection Statistics ProST Online statistical assessment of group separation in data projection analysis\",\"authors\":\"Danny Salem, Anuradha Surendra, Graeme SV McDowell, Miroslava Cuperlovic-Culf\",\"doi\":\"10.1101/2024.09.04.611273\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Unsupervised data projection for the determination of trends in the data, visualization of multidimensional data in a reduced dimension space or feature space reduction through combination of data is a major step in data mining. Methods such as Principal Component Analysis or t-Distribution Stochastic Neighbor Embedding are regularly used as one of the first steps in computational biology or omics investigation. However, the significance of the separation of sample groups by these methods generally relies on visual assessment. User-friendly application for different projection methods, each focusing on distinct data properties, are needed as well as a rigorous method for statistical determination of the significance of separation of groups of interest in each dataset.\\nResults: We present Projection STatistics (ProST), a user-friendly solution for data projection analysis providing three unsupervised (PCA, t-SNE and UMAP) and one supervised (LDA) approach. For each method we are including a novel statistical investigation of the significance of group separation with Mann-Whitney U-rank or t-test analysis as well as necessary preprocessing steps. ProST provides an unbiased, objective application of the determination of the significance of the separation of measurement groups through either linear or manifold projection analysis with methods ranging from a focus on the separation of points based on major variances or on point proximity based on distance.\\nAvailability: The ProST software application is freely available at https://complimet.ca/shiny/ProST/ with source code provided on https://github.com/complimet/prost.\",\"PeriodicalId\":501307,\"journal\":{\"name\":\"bioRxiv - Bioinformatics\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.09.04.611273\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.04.611273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
动机通过无监督数据预测来确定数据趋势、在缩小的维度空间中实现多维数据的可视化或通过数据组合缩小特征空间,是数据挖掘的一个重要步骤。主成分分析或 t 分布随机邻域嵌入等方法经常被用作计算生物学或 omics 研究的第一步。然而,用这些方法分离样本组的意义通常依赖于视觉评估。我们需要针对不同投影方法(每种方法都侧重于不同的数据属性)的用户友好型应用程序,还需要一种严格的统计方法来确定每个数据集中相关组分离的显著性:我们提出了投影统计(ProST),这是一种用户友好型数据投影分析解决方案,提供三种无监督方法(PCA、t-SNE 和 UMAP)和一种有监督方法(LDA)。对于每种方法,我们都会通过曼-惠特尼 U-秩或 t 检验分析以及必要的预处理步骤,对组别分离的显著性进行新颖的统计调查。ProST 通过线性或流形投影分析,提供了一种无偏见的、客观的应用方法,用于确定测量组分离的显著性,方法包括基于主要方差的点分离或基于距离的点邻近性:ProST 软件应用程序可从 https://complimet.ca/shiny/ProST/ 免费获取,源代码可从 https://github.com/complimet/prost 获取。
Projection Statistics ProST Online statistical assessment of group separation in data projection analysis
Motivation: Unsupervised data projection for the determination of trends in the data, visualization of multidimensional data in a reduced dimension space or feature space reduction through combination of data is a major step in data mining. Methods such as Principal Component Analysis or t-Distribution Stochastic Neighbor Embedding are regularly used as one of the first steps in computational biology or omics investigation. However, the significance of the separation of sample groups by these methods generally relies on visual assessment. User-friendly application for different projection methods, each focusing on distinct data properties, are needed as well as a rigorous method for statistical determination of the significance of separation of groups of interest in each dataset.
Results: We present Projection STatistics (ProST), a user-friendly solution for data projection analysis providing three unsupervised (PCA, t-SNE and UMAP) and one supervised (LDA) approach. For each method we are including a novel statistical investigation of the significance of group separation with Mann-Whitney U-rank or t-test analysis as well as necessary preprocessing steps. ProST provides an unbiased, objective application of the determination of the significance of the separation of measurement groups through either linear or manifold projection analysis with methods ranging from a focus on the separation of points based on major variances or on point proximity based on distance.
Availability: The ProST software application is freely available at https://complimet.ca/shiny/ProST/ with source code provided on https://github.com/complimet/prost.