Fast Context-Aware Analysis of Genome Annotation Colocalization.

IF 1.4 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-10-09 DOI:10.1089/cmb.2024.0667

Askar Gafurov, Tomáš VinaŘ, Paul Medvedev, BroŇa Brejová

{"title":"Fast Context-Aware Analysis of Genome Annotation Colocalization.","authors":"Askar Gafurov, Tomáš VinaŘ, Paul Medvedev, BroŇa Brejová","doi":"10.1089/cmb.2024.0667","DOIUrl":null,"url":null,"abstract":"An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"946-964"},"PeriodicalIF":1.4000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698669/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0667","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/9 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.

查看原文本刊更多论文

基因组注释定位的快速上下文感知分析

注释是一组具有特定功能或属性的基因组区间。例如基因或其外显子、序列重复、具有特定表观遗传状态的区域以及拷贝数变异。一个常见的任务是比较两个注释，以确定一个注释在另一个注释覆盖的区域中是富集还是贫乏。我们研究了基于代表随机不相关注释的空模型为这种比较分配统计意义的问题。为了将更多背景信息纳入此类分析，我们提出了一种基于马尔可夫链的新无效模型，该模型可区分多种基因组背景。这些背景可以捕捉各种干扰因素，如 GC 含量或装配间隙。然后，我们开发了一种新算法，通过计算检验统计量的精确期望和方差，然后使用正态近似估计 p 值。与 Gafurov 等人之前的算法相比，新算法有三个进步：(1) 运行时间从二次改进为线性或准线性；(2) 算法可以处理两种不同的检验统计量；(3) 算法既可以处理简单的马尔可夫链空模型，也可以处理依赖于上下文的马尔可夫链空模型。我们在合成数据集和真实数据集上展示了我们算法的效率和准确性，包括最近的人类端粒到端粒组装。特别是，我们的算法使用 24 个线程在不到三小时的时间内计算出了 450 对人类基因组注释的 p 值。此外，利用基因组上下文校正 GC 偏差的结果还推翻了之前发表的一些发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Biology 生物-计算机：跨学科应用

CiteScore

3.60

自引率

5.90%

发文量

113

审稿时长

6-12 weeks

期刊介绍： Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases