Predicting gene expression using millions of yeast promoters reveals cis-regulatory logic.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-06-02 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf130
Tirtharaj Dash, Susanne Bornelöv
{"title":"Predicting gene expression using millions of yeast promoters reveals <i>cis</i>-regulatory logic.","authors":"Tirtharaj Dash, Susanne Bornelöv","doi":"10.1093/bioadv/vbaf130","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Gene regulation involves complex interactions between transcription factors. While early attempts to predict gene expression were trained using naturally occurring promoters, gigantic parallel reporter assays have vastly expanded potential training data. Despite this, it is still unclear how to best use deep learning to study gene regulation. Here, we investigate the association between promoters and expression using Camformer, a residual convolutional neural network that ranked fourth in the Random Promoter DREAM Challenge 2022. We present the original model trained on 6.7 million sequences and investigate 270 alternative models to find determinants of model performance. Finally, we use explainable AI to uncover regulatory signals.</p><p><strong>Results: </strong>Camformer accurately decodes the association between promoters and gene expression ( <math> <mrow> <mrow> <msup><mrow><mi>r</mi></mrow> <mn>2</mn></msup> </mrow> <mo>=</mo> <mn>0.914</mn> <mo> ± </mo> <mn>0.003</mn></mrow> </math> , <math><mrow><mi>ρ</mi> <mo>=</mo> <mn>0.962</mn> <mo> ± </mo> <mn>0.002</mn></mrow> </math> ) and provides a substantial improvement over previous state of the art. Using Grad-CAM and in silico mutagenesis, we demonstrate that our model learns both individual motifs and their hierarchy. For example, while an IME1 motif on its own increases gene expression, a co-occurring UME6 motif instead strongly reduces gene expression. Thus, deep learning models such as Camformer can provide detailed insights into <i>cis</i>-regulatory logic.</p><p><strong>Availability and implementation: </strong>Data and code are available at: https://github.com/Bornelov-lab/Camformer.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf130"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12188188/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Gene regulation involves complex interactions between transcription factors. While early attempts to predict gene expression were trained using naturally occurring promoters, gigantic parallel reporter assays have vastly expanded potential training data. Despite this, it is still unclear how to best use deep learning to study gene regulation. Here, we investigate the association between promoters and expression using Camformer, a residual convolutional neural network that ranked fourth in the Random Promoter DREAM Challenge 2022. We present the original model trained on 6.7 million sequences and investigate 270 alternative models to find determinants of model performance. Finally, we use explainable AI to uncover regulatory signals.

Results: Camformer accurately decodes the association between promoters and gene expression ( r 2 = 0.914  ±  0.003 , ρ = 0.962  ±  0.002 ) and provides a substantial improvement over previous state of the art. Using Grad-CAM and in silico mutagenesis, we demonstrate that our model learns both individual motifs and their hierarchy. For example, while an IME1 motif on its own increases gene expression, a co-occurring UME6 motif instead strongly reduces gene expression. Thus, deep learning models such as Camformer can provide detailed insights into cis-regulatory logic.

Availability and implementation: Data and code are available at: https://github.com/Bornelov-lab/Camformer.

利用数百万个酵母启动子预测基因表达揭示了顺式调控逻辑。
动机:基因调控涉及转录因子之间复杂的相互作用。虽然早期预测基因表达的尝试是使用自然发生的启动子进行训练,但巨大的平行报告基因分析已经极大地扩展了潜在的训练数据。尽管如此,人们仍然不清楚如何最好地利用深度学习来研究基因调控。在这里,我们使用Camformer研究启动子和表达之间的关系,Camformer是一个残差卷积神经网络,在2022年随机启动子梦想挑战赛中排名第四。我们提出了在670万个序列上训练的原始模型,并研究了270个替代模型,以找到模型性能的决定因素。最后,我们使用可解释的人工智能来发现调节信号。结果:Camformer能够准确解码启动子与基因表达之间的关系(r 2 = 0.914±0.003,ρ = 0.962±0.002),与之前的技术相比有了很大的改进。使用Grad-CAM和硅诱变,我们证明了我们的模型既可以学习单个基序,也可以学习它们的层次结构。例如,虽然IME1基序本身增加基因表达,但共同发生的UME6基序反而强烈降低基因表达。因此,Camformer等深度学习模型可以提供对顺式监管逻辑的详细见解。可用性和实现:数据和代码可在:https://github.com/Bornelov-lab/Camformer上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信