在太基准尺度上构建和搜索 BWT

Heng Li
{"title":"在太基准尺度上构建和搜索 BWT","authors":"Heng Li","doi":"arxiv-2409.00613","DOIUrl":null,"url":null,"abstract":"Motivation: Burrows-Wheeler Transform (BWT) is a common component in\nfull-text indices. Initially developed for data compression, it is particularly\npowerful for encoding redundant sequences such as pangenome data. However, BWT\nconstruction is resource intensive and hard to be parallelized, and many\nmethods for querying large full-text indices only report exact matches or their\nsimple extensions. These limitations have hampered the biological applications\nof full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.\nRopebwt3 could index 100 assembled human genomes in 21 hours and index 7.3\nterabases of commonly studied bacterial assemblies in 26 days. This was\nachieved using 82 gigabytes of memory at the peak without working disk space.\nRopebwt3 can find maximal exact matches and inexact alignments under affine-gap\npenalties, and can retrieve all distinct local haplotypes matching a query\nsequence. It demonstrates the feasibility of full-text indexing at the terabase\nscale. Availability and implementation: https://github.com/lh3/ropebwt3","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BWT construction and search at the terabase scale\",\"authors\":\"Heng Li\",\"doi\":\"arxiv-2409.00613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Burrows-Wheeler Transform (BWT) is a common component in\\nfull-text indices. Initially developed for data compression, it is particularly\\npowerful for encoding redundant sequences such as pangenome data. However, BWT\\nconstruction is resource intensive and hard to be parallelized, and many\\nmethods for querying large full-text indices only report exact matches or their\\nsimple extensions. These limitations have hampered the biological applications\\nof full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.\\nRopebwt3 could index 100 assembled human genomes in 21 hours and index 7.3\\nterabases of commonly studied bacterial assemblies in 26 days. This was\\nachieved using 82 gigabytes of memory at the peak without working disk space.\\nRopebwt3 can find maximal exact matches and inexact alignments under affine-gap\\npenalties, and can retrieve all distinct local haplotypes matching a query\\nsequence. It demonstrates the feasibility of full-text indexing at the terabase\\nscale. Availability and implementation: https://github.com/lh3/ropebwt3\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

动机Burrows-Wheeler Transform(BWT)是全文索引中的一个常用组件。它最初是为数据压缩而开发的,尤其适用于冗余序列(如泛基因组数据)的编码。然而,BWT 的构建需要大量资源,难以并行化,而且许多查询大型全文索引的方法只能报告精确匹配或其简单扩展。这些局限性阻碍了全文索引在生物学上的应用。结果Ropebwt3 可在 21 小时内为 100 个已组装的人类基因组建立索引,并在 26 天内为 7.3 个常用细菌组装数据库建立索引。Ropebwt3 可以在仿射校正条件下找到最大精确匹配和不精确排列,并能检索与查询序列匹配的所有不同的局部单倍型。它证明了全文索引在大型数据库中的可行性。可用性和实现:https://github.com/lh3/ropebwt3
本文章由计算机程序翻译,如有差异,请以英文原文为准。
BWT construction and search at the terabase scale
Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using 82 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve all distinct local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale. Availability and implementation: https://github.com/lh3/ropebwt3
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信