Mining for putative regulatory elements in the yeast genome using gene expression data.

J Vilo, A Brazma, I Jonassen, A Robinson, E Ukkonen
{"title":"Mining for putative regulatory elements in the yeast genome using gene expression data.","authors":"J Vilo,&nbsp;A Brazma,&nbsp;I Jonassen,&nbsp;A Robinson,&nbsp;E Ukkonen","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We have developed a set of methods and tools for automatic discovery of putative regulatory signals in genome sequences. The analysis pipeline consists of gene expression data clustering, sequence pattern discovery from upstream sequences of genes, a control experiment for pattern significance threshold limit detection, selection of interesting patterns, grouping of these patterns, representing the pattern groups in a concise form and evaluating the discovered putative signals against existing databases of regulatory signals. The pattern discovery is computationally the most expensive and crucial step. Our tool performs a rapid exhaustive search for a priori unknown statistically significant sequence patterns of unrestricted length. The statistical significance is determined for a set of sequences in each cluster with respect to a set of background sequences allowing the detection of subtle regulatory signals specific for each cluster. The potentially large number of significant patterns is reduced to a small number of groups by clustering them by mutual similarity. Automatically derived consensus patterns of these groups represent the results in a comprehensive way for a human investigator. We have performed a systematic analysis for the yeast Saccharomyces cerevisiae. We created a large number of independent clusterings of expression data simultaneously assessing the \"goodness\" of each cluster. For each of the over 52,000 clusters acquired in this way we discovered significant patterns in the upstream sequences of respective genes. We selected nearly 1,500 significant patterns by formal criteria and matched them against the experimentally mapped transcription factor binding sites in the SCPD database. We clustered the 1,500 patterns to 62 groups for which we derived automatically alignments and consensus patterns. Of these 62 groups 48 had patterns that have matching sites in SCPD database.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2000-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We have developed a set of methods and tools for automatic discovery of putative regulatory signals in genome sequences. The analysis pipeline consists of gene expression data clustering, sequence pattern discovery from upstream sequences of genes, a control experiment for pattern significance threshold limit detection, selection of interesting patterns, grouping of these patterns, representing the pattern groups in a concise form and evaluating the discovered putative signals against existing databases of regulatory signals. The pattern discovery is computationally the most expensive and crucial step. Our tool performs a rapid exhaustive search for a priori unknown statistically significant sequence patterns of unrestricted length. The statistical significance is determined for a set of sequences in each cluster with respect to a set of background sequences allowing the detection of subtle regulatory signals specific for each cluster. The potentially large number of significant patterns is reduced to a small number of groups by clustering them by mutual similarity. Automatically derived consensus patterns of these groups represent the results in a comprehensive way for a human investigator. We have performed a systematic analysis for the yeast Saccharomyces cerevisiae. We created a large number of independent clusterings of expression data simultaneously assessing the "goodness" of each cluster. For each of the over 52,000 clusters acquired in this way we discovered significant patterns in the upstream sequences of respective genes. We selected nearly 1,500 significant patterns by formal criteria and matched them against the experimentally mapped transcription factor binding sites in the SCPD database. We clustered the 1,500 patterns to 62 groups for which we derived automatically alignments and consensus patterns. Of these 62 groups 48 had patterns that have matching sites in SCPD database.

利用基因表达数据挖掘酵母基因组中可能的调控元件。
我们已经开发了一套方法和工具来自动发现基因组序列中假定的调节信号。分析管道包括基因表达数据聚类、从上游基因序列中发现序列模式、模式显著性阈值限制检测的控制实验、有趣模式的选择、这些模式的分组、以简洁的形式表示模式组以及根据现有的调控信号数据库评估发现的假设信号。模式发现在计算上是最昂贵和最关键的一步。我们的工具执行一个快速穷尽搜索先验未知的统计显著序列模式的无限制长度。相对于一组背景序列,确定每个簇中的一组序列的统计显著性,允许检测特定于每个簇的细微调节信号。通过相互相似性聚类,将潜在的大量重要模式减少到少量组。这些群体的自动衍生的共识模式代表了人类研究者以全面的方式得出的结果。我们对酿酒酵母进行了系统的分析。我们创建了大量独立的表达数据聚类,同时评估每个聚类的“好”度。对于以这种方式获得的52,000多个集群中的每一个,我们在各自基因的上游序列中发现了重要的模式。我们根据正式标准选择了近1500个显著模式,并将它们与SCPD数据库中实验绘制的转录因子结合位点进行匹配。我们将1500个模式聚集到62个组中,并为其自动导出对齐和一致模式。在这62组中,48组的模式在SCPD数据库中有匹配的位点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信