Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2024-12-18 DOI:10.1186/s13059-024-03452-y

Prasad Sarashetti, Josipa Lipovac, Filip Tomas, Mile Šikić, Jianjun Liu

{"title":"Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references","authors":"Prasad Sarashetti, Josipa Lipovac, Filip Tomas, Mile Šikić, Jianjun Liu","doi":"10.1186/s13059-024-03452-y","DOIUrl":null,"url":null,"abstract":"Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges. Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT’s Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15–20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs. Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"1 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-024-03452-y","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges. Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT’s Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15–20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs. Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

查看原文本刊更多论文

评估高质量单倍型解析基因组的数据需求，以创建稳健的泛基因组参考

来自太平洋生物科学公司（PacBio）和牛津纳米孔技术公司（ONT）的长读技术通过提供多种数据类型，如HiFi、Duplex和超长ONT，改变了基因组学研究。尽管最近在利用长读技术实现单倍型阶段无间隙基因组组装方面取得了进展，但对遗传多样性的表现的关注仍然存在，这促使了泛基因组参考文献的发展。然而，泛基因组研究在努力保持敏感性的同时，面临着与每个组装基因组的数据类型、数量和成本考虑相关的挑战。缺乏对最佳数据选择的全面指导加剧了这些挑战。我们的研究评估了为种群级泛基因组项目建立强大的从头基因组组装管道所需的推荐数据类型和数量，广泛检查了ONT的Duplex和PacBio HiFi数据集之间的性能，以实现具有增强的连续性和完整性的高质量阶段基因组。结果表明，实现染色体水平的单倍型解析组装需要20倍高质量的长读取，如PacBio HiFi或ONT Duplex，结合每个单倍型15-20倍的超长ONT和10倍的远程数据，如Omni-C或Hi-C。来自两个平台的高质量长读取产生具有相当连续性的组件，HiFi在相位精度方面表现出色，而Duplex产生更多的T2T组件。我们的研究为种群水平泛基因组项目中稳健的从头基因组组装提供了最佳数据类型和数量的见解。重新评估本研究中推荐的数据类型和数量，并使其与实际经济限制保持一致，对泛基因组研究界至关重要，有助于他们的努力，并推动基因组研究产生更广泛的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.