cis-Regulatory element prediction in mammalian genomes

2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05) Pub Date : 2005-08-08 DOI:10.1109/CSBW.2005.35

A. Siddiqui, Gordon Robertson, M. Bilenky, T. Astakhova, O. Griffith, M. Hassel, Keven Lin, S. Montgomery, M. Oveisi, E. Pleasance, Neil Robertson, M. Sleumer, Kevin Teague, R. Varhol, Maggie Zhang, Steven J. M. Jones

{"title":"cis-Regulatory element prediction in mammalian genomes","authors":"A. Siddiqui, Gordon Robertson, M. Bilenky, T. Astakhova, O. Griffith, M. Hassel, Keven Lin, S. Montgomery, M. Oveisi, E. Pleasance, Neil Robertson, M. Sleumer, Kevin Teague, R. Varhol, Maggie Zhang, Steven J. M. Jones","doi":"10.1109/CSBW.2005.35","DOIUrl":null,"url":null,"abstract":"The identification of cis-regulatory elements and modules is an important step in understanding the regulation of genes. We have developed a pipeline capable of running multiple motif prediction methods on a whole genome scale. Using gene expression datasets to identify co-expressed genes and the Ensemhl Compara database orthologues, we assemble input sequence sets comprised of the upstream regions of a target gene, its orthologues and co-expressed genes on the premise that such genes will share promoters by evolution (orthologues) or share regulatory control mechanisms (co-expressed genes). Co-expressed genes are identified by an approach that combines Pearson distances from multiple gene expression datasets derived from multiple experimental approaches and calibrated against the GO database. Our pipeline runs a number of established motif detection algorithms with a range of parameter settings on the input dataset. We integrate the diverse result sets by scoring motifs with a method-independent function. For each target gene, we assign p-values to the motif score by running the discovery pipeline on multiple sets of input sequence containing the target gene, non-coexpressed genes and \"Jake\" orthologues generated by neutral numerical evolution. We have predicted 30,636 motif binding sites in human for 4,182 genes and an initial set of 472 motif binding sites in mouse for 92 genes with p<0.001. The positive predictive value against a library of biologically confirmed regulatory sites approaches 0.4 at the highest p-value threshold. Predicted regulatory elements and other resources from the project are available at www.cisred.org.","PeriodicalId":123531,"journal":{"name":"2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSBW.2005.35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The identification of cis-regulatory elements and modules is an important step in understanding the regulation of genes. We have developed a pipeline capable of running multiple motif prediction methods on a whole genome scale. Using gene expression datasets to identify co-expressed genes and the Ensemhl Compara database orthologues, we assemble input sequence sets comprised of the upstream regions of a target gene, its orthologues and co-expressed genes on the premise that such genes will share promoters by evolution (orthologues) or share regulatory control mechanisms (co-expressed genes). Co-expressed genes are identified by an approach that combines Pearson distances from multiple gene expression datasets derived from multiple experimental approaches and calibrated against the GO database. Our pipeline runs a number of established motif detection algorithms with a range of parameter settings on the input dataset. We integrate the diverse result sets by scoring motifs with a method-independent function. For each target gene, we assign p-values to the motif score by running the discovery pipeline on multiple sets of input sequence containing the target gene, non-coexpressed genes and "Jake" orthologues generated by neutral numerical evolution. We have predicted 30,636 motif binding sites in human for 4,182 genes and an initial set of 472 motif binding sites in mouse for 92 genes with p<0.001. The positive predictive value against a library of biologically confirmed regulatory sites approaches 0.4 at the highest p-value threshold. Predicted regulatory elements and other resources from the project are available at www.cisred.org.

查看原文本刊更多论文

哺乳动物基因组的顺式调控元件预测

顺式调控元件和模块的鉴定是理解基因调控的重要一步。我们已经开发了一个能够在全基因组规模上运行多个motif预测方法的管道。利用基因表达数据集鉴定共表达基因和Ensemhl Compara数据库同源基因，我们组装了由目标基因上游区域、其同源基因和共表达基因组成的输入序列集，前提是这些基因通过进化共享启动子(同源基因)或共享调控机制(共表达基因)。共表达基因是通过一种方法来识别的，该方法结合了来自多种实验方法的多个基因表达数据集的Pearson距离，并根据GO数据库进行校准。我们的流水线在输入数据集上运行一系列具有一系列参数设置的已建立的motif检测算法。我们通过一个与方法无关的函数来对不同的结果集进行积分。对于每个目标基因，我们通过在包含目标基因、非共表达基因和由中性数值进化产生的“Jake”同源基因的多组输入序列上运行发现管道，为motif得分分配p值。我们预测了4,182个基因在人类中有30,636个基序结合位点，而小鼠中有92个基因有472个基序结合位点，p<0.001。在最高p值阈值处，对生物学上确认的调控位点库的阳性预测值接近0.4。该项目预计的监管要素和其他资源可在www.cisred.org上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05)

自引率

0.00%

发文量