利用等位基因特异性和等位基因不确定性转录因子结合数据制定基准并构建 DNA 结合亲和力模型

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2024-10-31 DOI:10.1186/s13059-024-03424-2

Xiaoting Li, Lucas A. N. Melo, Harmen J. Bussemaker

{"title":"利用等位基因特异性和等位基因不确定性转录因子结合数据制定基准并构建 DNA 结合亲和力模型","authors":"Xiaoting Li, Lucas A. N. Melo, Harmen J. Bussemaker","doi":"10.1186/s13059-024-03424-2","DOIUrl":null,"url":null,"abstract":"Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed. We here propose methods for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We use a likelihood function based on an over-dispersed binomial distribution to aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. To facilitate the de novo inference of high-quality models from paired-end in vivo binding data such as ChIP-seq, ChIP-exo, and CUT&Tag without read mapping or peak calling, we introduce an extensible reimplementation of our biophysically interpretable machine learning framework named PyProBound. Explicitly accounting for assay-specific bias in DNA fragmentation rate when training on ChIP-seq yields improved TF binding models. Moreover, we show how PyProBound can leverage our threshold-free ASB likelihood function to perform de novo motif discovery using allele-specific ChIP-seq counts. Our work provides new strategies for predicting the functional impact of non-coding variants.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"8 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking and building DNA binding affinity models using allele-specific and allele-agnostic transcription factor binding data\",\"authors\":\"Xiaoting Li, Lucas A. N. Melo, Harmen J. Bussemaker\",\"doi\":\"10.1186/s13059-024-03424-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed. We here propose methods for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We use a likelihood function based on an over-dispersed binomial distribution to aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. To facilitate the de novo inference of high-quality models from paired-end in vivo binding data such as ChIP-seq, ChIP-exo, and CUT&Tag without read mapping or peak calling, we introduce an extensible reimplementation of our biophysically interpretable machine learning framework named PyProBound. Explicitly accounting for assay-specific bias in DNA fragmentation rate when training on ChIP-seq yields improved TF binding models. Moreover, we show how PyProBound can leverage our threshold-free ASB likelihood function to perform de novo motif discovery using allele-specific ChIP-seq counts. Our work provides new strategies for predicting the functional impact of non-coding variants.\",\"PeriodicalId\":12611,\"journal\":{\"name\":\"Genome Biology\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":10.1000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13059-024-03424-2\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-024-03424-2","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

转录因子（TF）以高度序列特异性的方式与 DNA 结合。这种特异性在体内表现为杂合位点上两个等位基因之间 TF 占有率的差异。目前，ChIP-seq 等基因组规模的检测方法在检测等位基因特异性结合（ASB）方面的能力有限，这既体现在读取覆盖率方面，也体现在所用细胞系中个体变异的代表性方面。因此，如果能对此类预测的可靠性进行定量评估，那么仅从序列预测等位基因在 TF 结合方面的差异就很有必要。在此，我们提出了一些方法，用于对 TF 结合的序列-亲和模型预测 ChIP-seq 计数中等位基因不平衡的能力进行基准测试。我们使用基于过度分散二项分布的似然函数来汇总整个基因组中等位基因偏好的证据，而不要求单个变异的统计显著性。这样，当同一 TF 有多个结合模型时，我们就能系统地比较其预测性能。为了便于从成对端体内结合数据（如 ChIP-seq、ChIP-exo 和 CUT&Tag）中从头推断高质量模型，而无需读图或峰值调用，我们引入了一个可扩展的可生物物理解释的机器学习框架的重新实现，命名为 PyProBound。在对 ChIP-seq 进行训练时，明确考虑 DNA 片段破碎率的检测特异性偏差，可以改进 TF 结合模型。此外，我们还展示了 PyProBound 如何利用我们的无阈值 ASB 似然函数，使用等位基因特异性 ChIP-seq 计数从头开始发现主题。我们的工作为预测非编码变异的功能影响提供了新策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking and building DNA binding affinity models using allele-specific and allele-agnostic transcription factor binding data

Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed. We here propose methods for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We use a likelihood function based on an over-dispersed binomial distribution to aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. To facilitate the de novo inference of high-quality models from paired-end in vivo binding data such as ChIP-seq, ChIP-exo, and CUT&Tag without read mapping or peak calling, we introduce an extensible reimplementation of our biophysically interpretable machine learning framework named PyProBound. Explicitly accounting for assay-specific bias in DNA fragmentation rate when training on ChIP-seq yields improved TF binding models. Moreover, we show how PyProBound can leverage our threshold-free ASB likelihood function to perform de novo motif discovery using allele-specific ChIP-seq counts. Our work provides new strategies for predicting the functional impact of non-coding variants.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.