Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.

IF 4 Q1 GENETICS & HEREDITY
NAR Genomics and Bioinformatics Pub Date : 2024-12-18 eCollection Date: 2024-12-01 DOI:10.1093/nargab/lqae172
Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis
{"title":"Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.","authors":"Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis","doi":"10.1093/nargab/lqae172","DOIUrl":null,"url":null,"abstract":"<p><p>Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae172"},"PeriodicalIF":4.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655291/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqae172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.

近似近邻图提供快速高效的嵌入,可应用于大规模生物数据。
降维(DR或嵌入)算法如t-SNE和UMAP在大数据可视化中有很多应用,但对于大数据集来说仍然很慢。在这里,我们进一步改进了UMAP类算法,通过(i)结合t-SNE和UMAP的几个方面来创建一个新的DR算法;(ii)用层次可导航小世界图(HNSW)代替其限速步骤k -最近邻图(K-NNG);(iii)将HNSW与局部敏感散列算法(例如MinHash)相结合,用于序列之间的距离估计,将功能扩展到DNA/RNA序列数据。我们还提供了额外的功能,包括局部固有维数和中心度的计算,这可以反映强烈影响K-NNG精度的底层数据的结构和属性,从而产生嵌入的质量。我们的库称为annembed,在Rust中实现并完全并行化,与流行的umap类算法相比,它显示出具有竞争力的准确性。此外,我们通过三个现实世界的例子展示了我们的库的实用性和可扩展性:可视化大规模微生物基因组数据库,可视化单细胞RNA测序数据和宏基因组contig(或种群)分组。因此,annebed可以在距离计算昂贵或有数百万到数十亿个数据点需要处理的生物数据分析的几个任务中促进DR。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
2.20%
发文量
95
审稿时长
15 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信