NAR Genomics and Bioinformatics最新文献_第2页

Improving accuracy in genome-wide association studies: a two-step approach for handling below limit of detection biomarker measurements. 提高全基因组关联研究的准确性：处理低于检测生物标志物测量限制的两步方法。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf201

Yaqi A Deng, Torgny Karlsson, Åsa Johansson

{"title":"Improving accuracy in genome-wide association studies: a two-step approach for handling below limit of detection biomarker measurements.","authors":"Yaqi A Deng, Torgny Karlsson, Åsa Johansson","doi":"10.1093/nargab/lqaf201","DOIUrl":"10.1093/nargab/lqaf201","url":null,"abstract":"Advances in high-throughput technologies enable large-scale studies on genomics and molecular phenotypes. However, the trade-off between quality and quantity reduces assay sensitivity, and several measurements in large-scale proteomics and metabolomics analytes fall below the limit of detection (LOD). If not properly addressed, this may introduce bias in effect estimates. To address this, we conducted a simulation study to evaluate the performance of linear, Tobit, Cox, and logistic modeling in the presence of below-LOD measurements in genome-wide association studies. We identified the optimal strategy as a two-step Linear-Tobit scheme, including rapid screening with linear regression followed by refinement with Tobit regression to retrieve accurate effect estimates. This higher accuracy helps mitigate a 1.3-fold and 2.7-fold inflation in causal estimates in a Mendelian randomization (MR) study, which would otherwise be present with 50% and 90% values below LOD. Validation through case studies on estradiol and testosterone levels in the UK Biobank confirmed the simulation results across subgroups with varying proportions of below-LOD measurements. The Linear-Tobit scheme offers optimal detection power and efficiency, with a focus on its applicability to biobank-scale datasets and accuracy in effect estimates to mitigate bias in downstream applications such as MR and polygenic risk scores.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf201"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A supervised Bayesian method for time (re)annotation of transcriptomics data. 一种用于转录组学数据时间（重新）注释的监督贝叶斯方法。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf203

Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela

{"title":"A supervised Bayesian method for time (re)annotation of transcriptomics data.","authors":"Elio Nushi, François P Douillard, Katja Selby, Benjamin A Blount, Oliver J Pennington, Nigel P Minton, Miia Lindström, Antti Honkela","doi":"10.1093/nargab/lqaf203","DOIUrl":"10.1093/nargab/lqaf203","url":null,"abstract":"Transcriptomics experiments are often conducted to capture changes in gene expression over time. However, time annotations may be missing, imprecise, or not reflect the same physiological state of the bacterial culture between different experiments. Assigning accurate time points to these experiments using a reference time course is therefore crucial for identifying differentially expressed genes, and understanding gene regulatory networks for elucidating the studied organism's physiology and life cycle. This important task, which could enhance the biological interpretation of the transcriptomics experiments, has not been previously addressed. In this work, we propose a novel method to solve the challenge of realigning transcriptomics experiments based on a reference time course. Our method is based on a Bayesian approach that uses Gaussian process regression modeling. We show a use case of applying our method for assigning time annotations in legacy microarray samples of the bacterium Clostridium botulinum, which were solely annotated based on the growth phase at the time when the culture aliquots were sampled, utilizing recently collected RNA-Seq time series data comprising multiple replicates as a reference. The method significantly improved the description of the growth phases of the microarray data compared to the original annotations by clearly delineating the microarray samples belonging to different growth phases, as demonstrated by principal component analysis. Consequently, a larger number of differentially expressed genes was detected when comparing experiments belonging to successive growth phases. We compare this innovative approach with a baseline method that uses k-nearest neighbor algorithm and show that our method offers a higher resolution in the description of the data by exposing smaller time changes between samples. We also test the performance of the method on sparse RNA-Seq time series (i.e. sampled every second hour). All the predictions for the samples were within a 30-min margin of their true time.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf203"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DEDUCE: statistical inference on disease-associated genes uncovers tissue-disease associations. 推论：对疾病相关基因的统计推断揭示了组织与疾病的关联。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf205

Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin

{"title":"DEDUCE: statistical inference on disease-associated genes uncovers tissue-disease associations.","authors":"Boqi Wang, Jiayi Wang, Ammar Aleem Rashied, Bo Meng, Jesse Zhang, Jun S Liu, Jie Jiang, Zhaohui S Qin","doi":"10.1093/nargab/lqaf205","DOIUrl":"10.1093/nargab/lqaf205","url":null,"abstract":"Accurate identification of affected tissues of human diseases is important for the derivation of disease etiology and the development of new treatment strategies. In this study, we develop a logistic regression-based method named DEDUCE (disease tissue detection using logistic regression) that combines genomics big data and machine learning to address this important problem. The central hypothesis is that most disease-associated genes are expressed specifically in affected tissues. DEDUCE takes advantage of newly emerged data on disease-related genes as well as tissue-specific gene expression data. The unique feature of DEDUCE is that it takes into account the strength of gene-disease associations. When we applied DEDUCE to a total of 3261, 324 gene-disease associations collected from DisGeNET covering 30,170 diseases and 21,666 genes, we identified 216 significant tissue-disease pairs composed of 120 unique diseases and 37 unique tissues. Many of them shed light on potential explanations for disease pathogenesis. The results showed great consistency with previous findings and were proven effective by empirical plots and gene set enrichment analysis. Overall, DEDUCE has shown great potential in uncovering novel pathogenesis mechanisms of complex diseases. In-depth analysis and experimental validation were required to fully understand these discovered tissue-trait associations and their enriched genes.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf205"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IRESeek: structure-informed deep learning method for accurate identification of internal ribosome entry sites in circular RNAs. IRESeek：结构信息深度学习方法，用于准确识别环状rna的内部核糖体进入位点。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf210

Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong

{"title":"IRESeek: structure-informed deep learning method for accurate identification of internal ribosome entry sites in circular RNAs.","authors":"Feng Zhang, Heqin Zhu, Jiayin Gao, Jie Hu, Ke Chen, Shaohua Kevin Zhou, Peng Xiong","doi":"10.1093/nargab/lqaf210","DOIUrl":"10.1093/nargab/lqaf210","url":null,"abstract":"The internal ribosome entry site (IRES) is a special type of RNA cis-acting element that can initiate translation independently of the 5' cap structure and is widely found in viral RNAs and eukaryotic messenger RNAs. In recent years, an increasing number of studies have revealed that IRES elements also exist in circular RNAs (circRNAs) and mediate their translation. CircRNAs exhibit high stability and tissue specificity, playing critical roles in various physiological and pathological processes. Their coding potential provides important clues for the discovery of novel functional proteins. However, due to the nonlinear structure of circRNAs and the complexity of IRES-mediated regulatory mechanisms, accurately identifying IRES elements within circRNAs remains a significant challenge. Here, we propose IRESeek, a dual-branch deep learning framework for highly accurate detection of IRES elements in circRNA, which utilizes transformer for RNA sequence modeling and graph convolutional network for RNA structural guidance. To grasp the structural patterns of circRNAs, IRESeek employs physical-based thermodynamic energy of RNA secondary structure-base pair motif energy and the base pair probability as guidance structural characteristics to incorporate with RNA sequence, enabling comprehensive joint learning of RNA sequence and base pair interactions.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf210"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145889649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

G-quadruplex structures as modulators of alternative promoter usage. g -四重结构作为替代启动子使用的调制剂。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-31 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf208

Rongxin Zhang, Jean-Louis Mergny

{"title":"G-quadruplex structures as modulators of alternative promoter usage.","authors":"Rongxin Zhang, Jean-Louis Mergny","doi":"10.1093/nargab/lqaf208","DOIUrl":"10.1093/nargab/lqaf208","url":null,"abstract":"The precise regulation of gene transcription relies on promoters, and the selection of specific promoters for a particular gene is a key determinant of transcript diversity. However, the regulatory mechanisms governing promoter selection are not fully understood. G-quadruplexes (G4s) are unique DNA noncanonical secondary structures that have emerged as important regulators of gene expression. In this study, we systematically analyzed the relationship between G4 structures and alternative promoters (APs) in two cancer cell lines, K562 and HepG2, by integrating native elongating transcript-cap analysis of gene expression and G4 ChIP-seq datasets. We identified 573 differentially utilized APs (|fold change| > 2, false discovery rate < 0.05), 26% of which being associated with G4 structures within 100 base pairs. Notably, G4-associated promoters predominantly exhibited increased activity, suggesting that G4s generally promote AP selection. Furthermore, treatment with G4 ligands induced the generation of APs, suggesting that the stabilization of G4 structures may modulate AP usage. Collectively, these findings provide new insights into the G4-based mechanisms that regulate transcript isoform diversity.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf208"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Designing genetically stable multicopy gene constructs with the ChimeraUGEM web server. 使用ChimeraUGEM web服务器设计遗传稳定的多拷贝基因结构。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf191

Moritz Burghardt, Alon Diament, Tamir Tuller

{"title":"Designing genetically stable multicopy gene constructs with the ChimeraUGEM web server.","authors":"Moritz Burghardt, Alon Diament, Tamir Tuller","doi":"10.1093/nargab/lqaf191","DOIUrl":"10.1093/nargab/lqaf191","url":null,"abstract":"High expression of heterologous proteins is often achieved by integrating multiple copies of a gene into a host. However, such multicopy systems are prone to genetic instability due to homologous recombination between identical sequences. We present the multisequence ChimeraMap (MScMap), an algorithm for designing multiple synonymous coding sequences that minimizes recombination risk while maintaining high expression. MScMap extends the ChimeraMap framework by selecting diverse nucleotide blocks from a host genome to encode the target protein, balancing host adaptation and sequence dissimilarity. We introduce heuristics for block selection and concatenation to reduce long common substrings, a known driver of recombination. Our method outperforms a multi-objective evolutionary algorithm in both genetic stability and predicted expression across a wide range of human proteins while being significantly faster. We also show that MScMap can also be used to reduce sequence repeats within a single coding sequence. A web tool for single and multicopy coding sequence optimization is available online.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf191"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction. 整合自然语言处理和基因组分析使准确的细菌表型预测。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf174

Daniel Gómez-Pérez, Alexander Keller

{"title":"Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction.","authors":"Daniel Gómez-Pérez, Alexander Keller","doi":"10.1093/nargab/lqaf174","DOIUrl":"10.1093/nargab/lqaf174","url":null,"abstract":"Understanding microbial phenotypes from genomic data is crucial for studying co-evolution, ecology, and pathology. This study presents a scalable approach that integrates literature-extracted information with genomic data, combining natural language processing and functional genome analysis. We applied this method to publicly available data, providing novel insights into predicting microbial phenotypes. We fine-tuned transformer-based language models to analyze 3.83 million open-access scientific articles, extracting a phenotypic network of bacterial strains. This network maps relationships between strains and traits such as pathogenicity, metabolism, and biome preference. By annotating their reference genomes, we predicted key genes influencing these traits. Our findings align with known phenotypes, reveal novel correlations, and uncover genes involved in disease and host associations. The network's interconnectivity provides deeper understanding of microbial communities and allowed identification of hub species through inferred trophic connections that are difficult to infer experimentally. This work demonstrates the potential of machine learning for uncovering cross-species gene-phenotype patterns. As microbial genomic data and literature expand, such methods will be essential for extracting meaningful insights and advancing microbiology research. In summary, this integrative approach can accelerate discovery and understanding in microbial genomics. Ultimately, such techniques will facilitate the study of microbial ecology, co-evolutionary processes, and disease pathogenesis to an unprecedented depth.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf174"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746109/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A computational framework to dissect imputation strategies for single-cell histone modification data. 一个计算框架来剖析为单细胞组蛋白修饰数据的imputation策略。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-29 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf192

Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide

{"title":"A computational framework to dissect imputation strategies for single-cell histone modification data.","authors":"Marta Moreno-González, Jeroen de Ridder, Jop Kind, Robin H van der Weide","doi":"10.1093/nargab/lqaf192","DOIUrl":"10.1093/nargab/lqaf192","url":null,"abstract":"Single-cell profiling of histone post-translational modifications (scHPTMs) offers a powerful lens for dissecting epigenetic regulation and cellular identity, yet low read depth and inherent noise in these datasets pose significant analytical challenges. Here, we introduce the first comprehensive computational framework that systematically evaluates imputation strategies on scHPTM data, including methods originally developed for scRNA-seq and scATAC-seq. Leveraging both synthetic and published datasets, we apply novel performance metrics-implemented in a modular R package-to assess signal recovery, enrichment at biologically relevant genomic sites, and preservation of cell-to-cell similarities. Our extensive benchmarking reveals that performance varies markedly by analytical task (e.g. signal denoising, peak detection, and clustering), highlighting that no one-size-fits-all solution exists for these data. By delineating the strengths and limitations of current imputation approaches, this work lays the foundation for the targeted development of next-generation, task-aware algorithms, while providing critical guidance for researchers and developers on the current capabilities and unmet needs in single-cell epigenomics.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf192"},"PeriodicalIF":2.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12746105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145865362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Life at the extremes: maximally divergent microbes with similar genomic signatures linked to extreme environments. 极端环境下的生命：极端环境下具有相似基因组特征的最大差异微生物。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-23 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf189

Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari

{"title":"Life at the extremes: maximally divergent microbes with similar genomic signatures linked to extreme environments.","authors":"Monireh Safari, Joseph Butler, Gurjit S Randhawa, Kathleen A Hill, Lila Kari","doi":"10.1093/nargab/lqaf189","DOIUrl":"10.1093/nargab/lqaf189","url":null,"abstract":"Extreme environments impose strong mutation and selection pressures that drive distinctive, yet understudied, genomic adaptations in extremophiles. In this study, we identify 15 bacterium-archaeon pairs that exhibit highly similar [Formula: see text]-mer-based genomic signatures despite maximal taxonomic divergence, suggesting that shared environmental conditions can produce convergent, genome-wide sequence patterns that transcend evolutionary distance. To uncover these patterns, we developed a computational pipeline to select a composite genome proxy assembled from noncontiguous subsequences of the genome. Using supervised machine learning on a curated dataset of 693 extremophile microbial genomes, we found that 6-mers and 100 kbp genome proxy lengths provide the best balance between classification accuracy and computational efficiency. Our results provide conclusive evidence of the pervasive nature of [Formula: see text]-mer-based patterns across the genome, and uncover the presence of taxonomic and environmental components that persist across all regions of the genome. The 15 bacterium-archaeon pairs identified by our method as having similar genomic signatures were validated through multiple independent analyses, including 3-mer frequency profile comparisons, phenotypic trait similarity, and geographic co-occurrence data. These complementary validations confirmed that extreme environmental pressures can override traditionally recognized taxonomic components at the whole-genome level. Together, these findings reveal that adaptation to extreme conditions can carry robust, taxonomic domain-spanning imprints on microbial genomes, offering new insight into the relationship between environmental impacts and genome sequence composition convergence.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf189"},"PeriodicalIF":2.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145828555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pastrami: a fast and efficient algorithm for fine-scale genetic ancestry inference. 熏牛肉：一种快速有效的精细尺度遗传祖先推断算法。

IF 2.8

NAR Genomics and Bioinformatics Pub Date : 2025-12-23 eCollection Date: 2025-12-01 DOI: 10.1093/nargab/lqaf184

Andrew B Conley, Lavanya Rishishwar, Shivam Sharma, Emily T Norris, I King Jordan, Leonardo Mariño-Ramírez

{"title":"Pastrami: a fast and efficient algorithm for fine-scale genetic ancestry inference.","authors":"Andrew B Conley, Lavanya Rishishwar, Shivam Sharma, Emily T Norris, I King Jordan, Leonardo Mariño-Ramírez","doi":"10.1093/nargab/lqaf184","DOIUrl":"10.1093/nargab/lqaf184","url":null,"abstract":"Genomics research increasingly relies on large population biobanks that include many thousands of participants. However, current genetic ancestry inference methods are computationally inefficient and prohibitively slow when applied to such large cohorts. The aim of this work was to develop a fast and efficient algorithm for fine-scale genetic ancestry inference on biobank-size cohorts. The Pastrami algorithm that we developed performs supervised genetic ancestry inference by comparing haplotypes between query and global reference samples, creating query and reference haplotype copying vectors, and relating them via non-negative least squares regression to estimate ancestry fractions. We used Pastrami for ancestry inference on genomic data sets from Africa, the Americas, and the United Kingdom, comparing its accuracy and runtime performance to the most widely used haplotype-based ancestry inference methods. Pastrami ancestry estimates are highly similar to estimates from the ChromoPainter and RFMix programs. The total CPU time required by Pastrami increases linearly with the number of samples, and it achieves ∼45× faster runtime than ChromoPainter. When run on 488 377 UK Biobank and 3433 reference samples, Pastrami used 2340 CPU hours compared to ∼105 000 CPU hours for ChromoPainter. The Pastrami program and documentation are made freely available on GitHub: https://github.com/healthdisparities/pastrami.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf184"},"PeriodicalIF":2.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723237/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145828621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0