Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.

Askar Gafurov, Tomáš Vinař, Paul Medvedev, Broňa Brejová
{"title":"Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.","authors":"Askar Gafurov, Tomáš Vinař, Paul Medvedev, Broňa Brejová","doi":"10.1007/978-1-0716-3989-4_3","DOIUrl":null,"url":null,"abstract":"<p><p>An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.</p>","PeriodicalId":74675,"journal":{"name":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","volume":"14758 ","pages":"38-53"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12037170/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research in computational molecular biology : ... Annual International Conference, RECOMB ... : proceedings. RECOMB (Conference : 2005- )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-1-0716-3989-4_3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/17 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.

基因组上下文注释共定位核算的高效分析。
注释是一组共享特定功能或属性的基因组间隔。例子包括基因、保守元件和表观遗传修饰。一个常见的任务是比较两个注释,以确定一个注释在另一个注释覆盖的区域中是丰富的还是耗尽的。我们研究了基于表示两个随机无关注释的null模型的比较的统计显著性分配问题。为了将更多的背景信息纳入此类分析并避免有偏差的结果,我们提出了一个基于马尔可夫链的新零模型,该模型区分了几种基因组背景。这些上下文可以捕获各种混淆因素,例如GC内容或测序间隙。然后,我们开发了一种新的算法,通过计算检验统计量的精确期望和方差,然后使用正态近似估计p值来估计p值。与Gafurov等人之前的算法相比,新算法提供了三个进步:(1)将运行时间从二次型提高到线性或拟线性;(2)算法可以处理两种不同的测试统计量;(3)算法可以处理简单的和上下文相关的马尔可夫链空模型。我们展示了我们的算法在合成和真实数据集上的效率和准确性,包括最近的人类端粒到端粒组装。特别是,我们的算法在3小时内使用24个线程计算了450对人类基因组注释的p值。使用基因组背景来纠正gc偏差也导致了一些先前发表的研究结果的逆转。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信