Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2024-12-18 eCollection Date: 2024-12-01 DOI:10.1093/nargab/lqae172

Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis

{"title":"Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.","authors":"Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis","doi":"10.1093/nargab/lqae172","DOIUrl":null,"url":null,"abstract":"<p><p>Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae172"},"PeriodicalIF":4.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655291/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.

查看原文本刊更多论文

近似近邻图提供快速高效的嵌入，可应用于大规模生物数据。

降维（DR或嵌入）算法如t-SNE和UMAP在大数据可视化中有很多应用，但对于大数据集来说仍然很慢。在这里，我们进一步改进了UMAP类算法，通过(i)结合t-SNE和UMAP的几个方面来创建一个新的DR算法；（ii）用层次可导航小世界图（HNSW）代替其限速步骤k -最近邻图（K-NNG）；（iii）将HNSW与局部敏感散列算法（例如MinHash）相结合，用于序列之间的距离估计，将功能扩展到DNA/RNA序列数据。我们还提供了额外的功能，包括局部固有维数和中心度的计算，这可以反映强烈影响K-NNG精度的底层数据的结构和属性，从而产生嵌入的质量。我们的库称为annembed，在Rust中实现并完全并行化，与流行的umap类算法相比，它显示出具有竞争力的准确性。此外，我们通过三个现实世界的例子展示了我们的库的实用性和可扩展性：可视化大规模微生物基因组数据库，可视化单细胞RNA测序数据和宏基因组contig（或种群）分组。因此，annebed可以在距离计算昂贵或有数百万到数十亿个数据点需要处理的生物数据分析的几个任务中促进DR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊