{"title":"xRead:一种覆盖引导的方法,用于可伸缩地构建read重叠图。","authors":"Tangchao Kong, Yadong Wang, Bo Liu","doi":"10.1093/gigascience/giaf007","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced.</p><p><strong>Findings: </strong>Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies.</p><p><strong>Conclusions: </strong>xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831799/pdf/","citationCount":"0","resultStr":"{\"title\":\"xRead: a coverage-guided approach for scalable construction of read overlapping graph.\",\"authors\":\"Tangchao Kong, Yadong Wang, Bo Liu\",\"doi\":\"10.1093/gigascience/giaf007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced.</p><p><strong>Findings: </strong>Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies.</p><p><strong>Conclusions: </strong>xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.</p>\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831799/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf007\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf007","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
xRead: a coverage-guided approach for scalable construction of read overlapping graph.
Background: The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced.
Findings: Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies.
Conclusions: xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.