利用RKS提高ISOMAP效率:与蛋白质序列上t分布随机邻居嵌入的比较研究

J Pub Date : 2023-10-31 DOI:10.3390/j6040038

Sarwan Ali, Murray Patterson

{"title":"利用RKS提高ISOMAP效率:与蛋白质序列上t分布随机邻居嵌入的比较研究","authors":"Sarwan Ali, Murray Patterson","doi":"10.3390/j6040038","DOIUrl":null,"url":null,"abstract":"Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on k-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.","PeriodicalId":73524,"journal":{"name":"J","volume":"64 7","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences\",\"authors\":\"Sarwan Ali, Murray Patterson\",\"doi\":\"10.3390/j6040038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on k-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.\",\"PeriodicalId\":73524,\"journal\":{\"name\":\"J\",\"volume\":\"64 7\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/j6040038\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/j6040038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据可视化在从高维数据集获得洞察力方面起着至关重要的作用。ISOMAP是一种流行的算法，它将高维数据映射到低维空间，同时保留底层几何结构。然而，由于计算数据点之间的成对距离，ISOMAP在计算上可能会很昂贵，特别是对于大型数据集。这项研究背后的动机是利用一种基于随机厨房水槽(RKS)的近似方法来提高效率。这种方法提供了一种更快的计算核矩阵的方法。使用RKS可以显著降低ISOMAP的计算复杂度，同时仍然获得有意义的低维数据表示。我们比较了使用RKS的近似ISOMAP方法与传统的t-SNE算法的性能。比较包括使用原始高维数据和从t-SNE和ISOMAP计算的低维数据计算距离矩阵。使用均方误差(MSE)、平均绝对误差(MAE)和解释方差评分(EVS)等指标来衡量低维嵌入的质量。此外，还记录了每个算法的运行时间，以评估其计算效率。比较是在一组蛋白质序列上进行的，用于许多生物信息学任务。我们使用基于k-mers、最小化器和位置权重矩阵(PWM)的三种不同的嵌入方法来捕获底层结构的各个方面以及蛋白质序列之间的关系。通过比较不同的嵌入，通过评估使用RKS的近似ISOMAP方法的有效性，并将其与t-SNE进行比较，我们提供了关于我们提出的方法有效性的见解。我们的目标是在提高计算性能的同时保持低维嵌入的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences

Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on k-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

自引率

0.00%

发文量

审稿时长

10 weeks