Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity

IF 3.7 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Varun Trivedi, Amirsadra Mohseni, Stefano Lonardi and Ian Wheeldon*, 
{"title":"Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity","authors":"Varun Trivedi,&nbsp;Amirsadra Mohseni,&nbsp;Stefano Lonardi and Ian Wheeldon*,&nbsp;","doi":"10.1021/acssynbio.4c0054210.1021/acssynbio.4c00542","DOIUrl":null,"url":null,"abstract":"<p >CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast <i>Yarrowia lipolytica</i> and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from <i>Y. lipolytica</i> and <i>Komagataella phaffii</i> leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.</p>","PeriodicalId":26,"journal":{"name":"ACS Synthetic Biology","volume":"13 11","pages":"3774–3781 3774–3781"},"PeriodicalIF":3.7000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acssynbio.4c00542","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Synthetic Biology","FirstCategoryId":"99","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acssynbio.4c00542","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast Yarrowia lipolytica and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from Y. lipolytica and Komagataella phaffii leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.

平衡训练集提高了基于深度学习的 CRISPR sgRNA 活性预测能力
CRISPR-Cas 系统为基因组编辑提供了一种多功能方法,从而改变了合成生物学领域。CRISPR 系统的效率在很大程度上取决于组成 sgRNA 的序列,因此有必要开发设计活性 sgRNA 的计算方法。虽然基于深度学习的模型已显示出预测 sgRNA 活性的前景,但预测的准确性主要取决于模型训练中使用的数据集。在这里,我们在从脂溶性酵母的 CRISPR-Cas12a 筛选数据中生成的平衡和不平衡数据集上训练了一个卷积神经网络(CNN)模型和一个大型语言模型(LLM),并评估了它们预测高活性和低活性 sgRNA 的能力。我们进一步测试了是否可以通过在不平衡数据集上训练合成 sgRNA 来提高预测性能。最后,我们证明在脂溶性酵母和 Komagataella phaffii 固有的不平衡 CRISPR-Cas9 数据集上添加合成 sgRNA 可提高预测 sgRNA 活性的性能,从而强调了采用平衡训练集进行准确 sgRNA 活性预测的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
10.60%
发文量
380
审稿时长
6-12 weeks
期刊介绍: The journal is particularly interested in studies on the design and synthesis of new genetic circuits and gene products; computational methods in the design of systems; and integrative applied approaches to understanding disease and metabolism. Topics may include, but are not limited to: Design and optimization of genetic systems Genetic circuit design and their principles for their organization into programs Computational methods to aid the design of genetic systems Experimental methods to quantify genetic parts, circuits, and metabolic fluxes Genetic parts libraries: their creation, analysis, and ontological representation Protein engineering including computational design Metabolic engineering and cellular manufacturing, including biomass conversion Natural product access, engineering, and production Creative and innovative applications of cellular programming Medical applications, tissue engineering, and the programming of therapeutic cells Minimal cell design and construction Genomics and genome replacement strategies Viral engineering Automated and robotic assembly platforms for synthetic biology DNA synthesis methodologies Metagenomics and synthetic metagenomic analysis Bioinformatics applied to gene discovery, chemoinformatics, and pathway construction Gene optimization Methods for genome-scale measurements of transcription and metabolomics Systems biology and methods to integrate multiple data sources in vitro and cell-free synthetic biology and molecular programming Nucleic acid engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信