PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices.

Q K Chen, G Z Hertz, G D Stormo
{"title":"PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices.","authors":"Q K Chen,&nbsp;G Z Hertz,&nbsp;G D Stormo","doi":"10.1093/bioinformatics/13.1.29","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further.</p><p><strong>Results: </strong>In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5-10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters-their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13,000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the 'c+2' language.</p>","PeriodicalId":77081,"journal":{"name":"Computer applications in the biosciences : CABIOS","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1997-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/bioinformatics/13.1.29","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer applications in the biosciences : CABIOS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/13.1.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40

Abstract

Motivation: A large number of new DNA sequences with virtually unknown functions are generated as the Human Genome Project progresses. Therefore, it is essential to develop computer algorithms that can predict the functionality of DNA segments according to their primary sequences, including algorithms that can predict promoters. Although several promoter-predicting algorithms are available, they have high false-positive detections and the rate of promoter detection needs to be improved further.

Results: In this research, PromFD, a computer program to recognize vertebrate RNA polymerase II promoters, has been developed. Both vertebrate promoters and non-promoter sequences are used in the analysis. The promoters are obtained from the Eukaryotic Promoter Database. Promoters are divided into a training set and a test set. Non-promoter sequences are obtained from the GenBank sequence databank, and are also divided into a training set and a test set. The first step is to search out, among all possible permutations, patterns of strings 5-10 bp long, that are significantly over-represented in the promoter set. The program also searches IMD (Information Matrix Database) matrices that have a significantly higher presence in the promoter set. The results of the searches are stored in the PromFD database, and the program PromFD scores input DNA sequences according to their content of the database entries. PromFD predicts promoters-their locations and the location of potential TATA boxes, if found. The program can detect 71% of promoters in the training set with a false-positive rate of under 1 in every 13,000 bp, and 47% of promoters in the test set with a false-positive rate of under 1 in every 9800 bp. PromFD uses a new approach and its false-positive identification rate is better compared with other available promoter recognition algorithms. The source code for PromFD is in the 'c+2' language.

PromFD 1.0:一个使用字符串和IMD矩阵预测真核pol II启动子的计算机程序。
动机:随着人类基因组计划的进展,产生了大量具有几乎未知功能的新DNA序列。因此,有必要开发能够根据DNA片段的初级序列预测其功能的计算机算法,包括能够预测启动子的算法。目前已有几种启动子预测算法,但存在较高的假阳性检出率,启动子检出率有待进一步提高。结果:本研究开发了识别脊椎动物RNA聚合酶II启动子的计算机程序PromFD。分析中使用了脊椎动物启动子和非启动子序列。这些启动子来自真核生物启动子数据库。启动子分为训练集和测试集。非启动子序列从GenBank序列数据库中获取,也分为训练集和测试集。第一步是在所有可能的排列中,找出在启动子集中显著过度代表的5-10 bp长的字符串模式。该程序还搜索IMD(信息矩阵数据库)矩阵,这些矩阵在启动子集中有显著更高的存在。搜索结果存储在PromFD数据库中,PromFD程序根据数据库条目的内容对输入的DNA序列进行评分。PromFD预测促销员的位置和潜在的TATA盒子的位置,如果发现的话。该程序可以检测到训练集中71%的启动子,每13000 bp的假阳性率低于1,测试集中47%的启动子,每9800 bp的假阳性率低于1。PromFD采用了一种新的方法,与现有的启动子识别算法相比,它的假阳性识别率更高。PromFD的源代码是用'c+2'语言编写的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信