Genome-scale prediction of bacterial promoters

Miria Bernardino, R. Beiko
{"title":"Genome-scale prediction of bacterial promoters","authors":"Miria Bernardino, R. Beiko","doi":"10.1109/CIBCB49929.2021.9562938","DOIUrl":null,"url":null,"abstract":"Proteins are responsible for many tasks including cell growth and metabolism. Transcription, the process where genes are used as templates for the production of a messenger RNA intermediate used in the synthesis of proteins, is regulated to ensure that the cell has the appropriate response according to its current needs. An essential step in transcription is the binding of a group of proteins, collectively known as RNA polymerase, to short promoter sequences upstream of the genes to be transcribed. Automated identification of promoters and nearby regulatory sequences can help to predict which genes are likely to be active under a given set of conditions. However, promoters are short, highly variable, and belong to subclasses that sometimes overlap, making their recognition a very difficult problem. Several tools have been developed to identify promoters in DNA, but methods are generally tested on small, balanced subsets of genomic sequence, and the results may not reflect their expected performance on genomes with millions of DNA base pairs in length where only $\\sim$ 1% of sequence is expected to correspond to promoters. Here we introduce Expositor, a neural-network-based method that uses different types of DNA encodings and tunable sensitivity and specificity parameters. Although the performance of Expositor on balanced datasets was comparable to that of other approaches, at the genome scale our approach finds the highest number of promoters (70% against 46%) with the smallest number of false positives. We also examined the accuracy of Expositor in distinguishing different classes of promoters, and found that misclassification between classes was consistent with the biological similarity between promoters. Expositor source code and pretrained model, and the datasets used for training and testing can be accessed at https://github.com/beiko-lab/Expositor.","PeriodicalId":163387,"journal":{"name":"2021 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB49929.2021.9562938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Proteins are responsible for many tasks including cell growth and metabolism. Transcription, the process where genes are used as templates for the production of a messenger RNA intermediate used in the synthesis of proteins, is regulated to ensure that the cell has the appropriate response according to its current needs. An essential step in transcription is the binding of a group of proteins, collectively known as RNA polymerase, to short promoter sequences upstream of the genes to be transcribed. Automated identification of promoters and nearby regulatory sequences can help to predict which genes are likely to be active under a given set of conditions. However, promoters are short, highly variable, and belong to subclasses that sometimes overlap, making their recognition a very difficult problem. Several tools have been developed to identify promoters in DNA, but methods are generally tested on small, balanced subsets of genomic sequence, and the results may not reflect their expected performance on genomes with millions of DNA base pairs in length where only $\sim$ 1% of sequence is expected to correspond to promoters. Here we introduce Expositor, a neural-network-based method that uses different types of DNA encodings and tunable sensitivity and specificity parameters. Although the performance of Expositor on balanced datasets was comparable to that of other approaches, at the genome scale our approach finds the highest number of promoters (70% against 46%) with the smallest number of false positives. We also examined the accuracy of Expositor in distinguishing different classes of promoters, and found that misclassification between classes was consistent with the biological similarity between promoters. Expositor source code and pretrained model, and the datasets used for training and testing can be accessed at https://github.com/beiko-lab/Expositor.
细菌启动子的基因组尺度预测
蛋白质负责许多任务,包括细胞生长和新陈代谢。转录是指基因被用作合成蛋白质的信使RNA中间体的模板的过程,它受到调控,以确保细胞根据当前的需要做出适当的反应。转录的一个重要步骤是将一组蛋白质(统称为RNA聚合酶)与待转录基因上游的短启动子序列结合。启动子和附近调控序列的自动识别可以帮助预测哪些基因在给定的条件下可能是活跃的。然而,启动子是短的,高度可变的,并且属于有时重叠的子类,使它们的识别成为一个非常困难的问题。已经开发了几种工具来识别DNA中的启动子,但是方法通常在基因组序列的小而平衡的子集上进行测试,并且结果可能无法反映它们在具有数百万DNA碱基对长度的基因组上的预期性能,其中只有$\ \ $ $ 1%的序列预计与启动子对应。在这里,我们介绍Expositor,这是一种基于神经网络的方法,它使用不同类型的DNA编码和可调的灵敏度和特异性参数。尽管Expositor在平衡数据集上的表现与其他方法相当,但在基因组规模上,我们的方法发现启动子数量最多(70%对46%),假阳性数量最少。我们还检验了Expositor区分不同类别启动子的准确性,发现类别之间的错误分类与启动子之间的生物学相似性是一致的。解释器源代码和预训练模型,以及用于训练和测试的数据集可以在https://github.com/beiko-lab/Expositor上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信