Pangenome-based genome inference using integer programming

IF 5.5 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genome research Pub Date : 2025-08-21 DOI:10.1101/gr.280567.125

Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain

{"title":"Pangenome-based genome inference using integer programming","authors":"Ghanshyam Chandra, Md Helal Hossen, Stephan Scholz, Alexander T Dilthey, Daniel Gibney, Chirag Jain","doi":"10.1101/gr.280567.125","DOIUrl":null,"url":null,"abstract":"Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., <em>k</em>-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"50 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.280567.125","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping method. Our optimization framework identifies a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. While this algorithm is designed for haploid genomes, we discuss directions for extending it to diploid genotyping.

查看原文本刊更多论文

基于泛基因组的整数规划基因组推断

负担得起的基因分型方法在基因组学中至关重要。常用的基因分型方法主要支持单核苷酸变异和短序列，而忽略了结构变异。此外，在高度多态性和重复的区域，对参考基因组的读取比对的准确性是不可靠的，这进一步影响了基因分型的性能。最近的工作强调了单倍型解析泛基因组图在解决这些挑战方面的优势。在这些发展的基础上，我们提出了一种严格的无比对基因分型方法。我们的优化框架通过泛基因组图确定了一条路径，该路径与测序读取子串（例如k-mers）之间的匹配最大化，同时最小化路径上的重组事件（单倍型切换）。我们证明了这个问题是np困难的，并给出了有效的整数规划解。我们使用纯合子人类细胞系的下采样短读数据集对算法进行基准测试，覆盖范围从0.1倍到10倍。我们的算法准确地估计完整的主要组织相容性复合体（MHC）单倍型序列，与基础真值序列的编辑距离很小，在低覆盖率输入上比现有方法具有显着优势。虽然该算法是为单倍体基因组设计的，但我们讨论了将其扩展到二倍体基因分型的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome research 生物-生化与分子生物学

CiteScore

12.40

自引率

1.40%

发文量

140

审稿时长

6 months

期刊介绍： Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine. Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies. New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.