Learning position weight matrices from sequence and expression data.

Xin Chen, Lingqiong Guo, Zhaocheng Fan, Tao Jiang
{"title":"Learning position weight matrices from sequence and expression data.","authors":"Xin Chen, Lingqiong Guo, Zhaocheng Fan, Tao Jiang","doi":"10.1142/9781860948732_0027","DOIUrl":null,"url":null,"abstract":"Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and discovering the binding targets of TFs. Given a set of binding sites bound by a TF, the learning problem can be formulated as a straightforward maximum likelihood problem, namely, finding a PWM such that the likelihood of the observed binding sites is maximized, and is usually solved by counting the base frequencies at each position of the aligned binding sequences. In this paper, we study the question of accurately learning a PWM from both binding site sequences and gene expression (or ChIP-chip) data. We revise the above maximum likelihood framework by taking into account the given gene expression or ChIP-chip data. More specifically, we attempt to find a PWM such that the likelihood of simultaneously observing both the binding sequences and the associated gene expression (or ChIP-chip) values is maximized, by using the sequence weighting scheme introduced in our recent work. We have incorporated this new approach for estimating PWMs into the popular motif finding program AlignACE. The modified program, called W-AlignACE, is compared with three other programs (AlignACE, MDscan, and MotifRegressor) on a variety of datasets, including simulated data, publicly available mRNA expression data, and ChIP-chip data. These large-scale tests demonstrate that W-AlignACE is an effective tool for discovering TF binding sites from gene expression or ChIP-chip data and, in particular, has the ability to find very weak motifs.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"249-60"},"PeriodicalIF":0.0000,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860948732_0027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and discovering the binding targets of TFs. Given a set of binding sites bound by a TF, the learning problem can be formulated as a straightforward maximum likelihood problem, namely, finding a PWM such that the likelihood of the observed binding sites is maximized, and is usually solved by counting the base frequencies at each position of the aligned binding sequences. In this paper, we study the question of accurately learning a PWM from both binding site sequences and gene expression (or ChIP-chip) data. We revise the above maximum likelihood framework by taking into account the given gene expression or ChIP-chip data. More specifically, we attempt to find a PWM such that the likelihood of simultaneously observing both the binding sequences and the associated gene expression (or ChIP-chip) values is maximized, by using the sequence weighting scheme introduced in our recent work. We have incorporated this new approach for estimating PWMs into the popular motif finding program AlignACE. The modified program, called W-AlignACE, is compared with three other programs (AlignACE, MDscan, and MotifRegressor) on a variety of datasets, including simulated data, publicly available mRNA expression data, and ChIP-chip data. These large-scale tests demonstrate that W-AlignACE is an effective tool for discovering TF binding sites from gene expression or ChIP-chip data and, in particular, has the ability to find very weak motifs.
从序列和表达式数据中学习位置权重矩阵。
位置权重矩阵(PWMs)在计算分子生物学和调控基因组学中被广泛用于描述转录因子(TFs)的DNA结合偏好。因此,学习精确的PWM来表征特定TF的结合位点是一个基础问题,在建模调节基序和发现TF的结合靶点中起着重要作用。给定一组由TF结合的结合位点,学习问题可以表述为一个简单的最大似然问题,即找到一个使观察到的结合位点的似然最大化的PWM,通常通过计算对齐的结合序列的每个位置的基频来解决。在本文中,我们研究了从结合位点序列和基因表达(或ChIP-chip)数据中准确学习PWM的问题。我们通过考虑给定的基因表达或ChIP-chip数据来修改上述最大似然框架。更具体地说,我们试图通过使用我们最近工作中引入的序列加权方案,找到一个PWM,使同时观察结合序列和相关基因表达(或ChIP-chip)值的可能性最大化。我们已经将这种估算pwm的新方法整合到流行的motif查找程序AlignACE中。修改后的程序,称为W-AlignACE,在各种数据集上与其他三个程序(AlignACE, MDscan和MotifRegressor)进行比较,包括模拟数据,公开可用的mRNA表达数据和ChIP-chip数据。这些大规模试验表明,W-AlignACE是一种从基因表达或ChIP-chip数据中发现TF结合位点的有效工具,特别是能够发现非常弱的基序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信