Prediction of Protein Expression and Growth Rates by Supervised Machine Learning

Simiao Zhao
{"title":"Prediction of Protein Expression and Growth Rates by Supervised Machine Learning","authors":"Simiao Zhao","doi":"10.4236/ns.2021.138025","DOIUrl":null,"url":null,"abstract":"The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.","PeriodicalId":19083,"journal":{"name":"Natural Science","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4236/ns.2021.138025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.
用监督机器学习预测蛋白质表达和生长速度
生物体的DNA序列对其转录和翻译过程有重要影响,从而影响其蛋白质的产生和生长速度。由于DNA的复杂性,预测生物的宏观特征是极其困难的。然而,随着近年来机器学习的快速发展,使用强大的机器学习算法来处理和分析生物数据成为可能。基于一种特定微生物——大肠杆菌的合成DNA序列,我设计了一个过程来预测它的蛋白质产量和生长速度。通过观察先前工作构建的数据集的属性,我选择使用带有编码DNA序列的监督学习回归器作为输入特征来执行预测。在比较了不同的编码器和算法后,我选择了三个编码器来编码DNA序列作为输入,并训练了七个不同的回归器来预测输出。对具有最佳预测潜力的三个回归量进行了超参数优化。最后,我利用编码器捕捉DNA序列的潜在特征,成功预测了蛋白质产量和生长率,R2得分最高,分别为0.55和0.77。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信