PLM_Sol：利用更新的大肠杆菌蛋白质溶解度数据集对多种蛋白质语言模型进行基准测试，从而预测蛋白质的溶解度。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2024-07-25 DOI:10.1093/bib/bbae404

Xuechun Zhang, Xiaoxuan Hu, Tongtong Zhang, Ling Yang, Chunhong Liu, Ning Xu, Haoyi Wang, Wen Sun

{"title":"PLM_Sol：利用更新的大肠杆菌蛋白质溶解度数据集对多种蛋白质语言模型进行基准测试，从而预测蛋白质的溶解度。","authors":"Xuechun Zhang, Xiaoxuan Hu, Tongtong Zhang, Ling Yang, Chunhong Liu, Ning Xu, Haoyi Wang, Wen Sun","doi":"10.1093/bib/bbae404","DOIUrl":null,"url":null,"abstract":"Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 5","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11343611/pdf/","citationCount":"0","resultStr":"{\"title\":\"PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.\",\"authors\":\"Xuechun Zhang, Xiaoxuan Hu, Tongtong Zhang, Ling Yang, Chunhong Liu, Ning Xu, Haoyi Wang, Wen Sun\",\"doi\":\"10.1093/bib/bbae404\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.\",\"PeriodicalId\":9209,\"journal\":{\"name\":\"Briefings in bioinformatics\",\"volume\":\"25 5\",\"pages\":\"\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11343611/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bib/bbae404\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae404","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质溶解度在各种生物技术、工业和生物医学应用中发挥着至关重要的作用。随着测序和基因合成成本的降低，采用高通量实验筛选加上量身定制的生物信息学预测，开发新型功能性酶（EOI）的趋势迅速增长。在这一过程中，高蛋白质溶解度至关重要，而准确预测溶解度是一项具有挑战性的任务。随着深度学习技术的不断发展，基于注意力的蛋白质语言模型（PLM）可以更大程度地从蛋白质序列中提取内在信息。利用这些模型以及从结构数据库（如蛋白质数据库）中推断出的越来越多的蛋白质溶解度数据，可以大大提高蛋白质溶解度的预测能力。在这项研究中，我们策划了一个更新的大肠杆菌蛋白质溶解度数据集（UESolDS），并采用多种 PLM 和分类层的组合来预测蛋白质的溶解度。最终得出的表现最佳的模型被命名为基于蛋白质语言模型的蛋白质溶解度预测模型（PLM_Sol），与之前报道的模型相比有显著改进，在独立测试集上的准确率显著提高了 6.4%，F1_score 提高了 9.0%，马修斯相关系数提高了 11.1%。此外，利用我们内部合成的蛋白质资源作为测试数据进行的额外评估也显示了 PLM_Sol 的良好性能。总之，PLM_Sol在独立测试集和实验集上都表现出了一致和良好的性能，因此非常适合用于大规模EOI研究。PLM_Sol是一个独立的程序，也是一个易于使用的模型，可在https://zenodo.org/doi/10.5281/zenodo.10675340。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.

查看原文本刊更多论文

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset.

Protein solubility plays a crucial role in various biotechnological, industrial, and biomedical applications. With the reduction in sequencing and gene synthesis costs, the adoption of high-throughput experimental screening coupled with tailored bioinformatic prediction has witnessed a rapidly growing trend for the development of novel functional enzymes of interest (EOI). High protein solubility rates are essential in this process and accurate prediction of solubility is a challenging task. As deep learning technology continues to evolve, attention-based protein language models (PLMs) can extract intrinsic information from protein sequences to a greater extent. Leveraging these models along with the increasing availability of protein solubility data inferred from structural database like the Protein Data Bank holds great potential to enhance the prediction of protein solubility. In this study, we curated an Updated Escherichia coli protein Solubility DataSet (UESolDS) and employed a combination of multiple PLMs and classification layers to predict protein solubility. The resulting best-performing model, named Protein Language Model-based protein Solubility prediction model (PLM_Sol), demonstrated significant improvements over previous reported models, achieving a notable 6.4% increase in accuracy, 9.0% increase in F1_score, and 11.1% increase in Matthews correlation coefficient score on the independent test set. Moreover, additional evaluation utilizing our in-house synthesized protein resource as test data, encompassing diverse types of enzymes, also showcased the good performance of PLM_Sol. Overall, PLM_Sol exhibited consistent and promising performance across both independent test set and experimental set, thereby making it well suited for facilitating large-scale EOI studies. PLM_Sol is available as a standalone program and as an easy-to-use model at https://zenodo.org/doi/10.5281/zenodo.10675340.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.