Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions

IF 4.4 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Plant Methods Pub Date : 2023-12-07 DOI:10.1186/s13007-023-01119-6

Kewei Zhou, Chenping Lei, Jingyan Zheng, Yan Huang, Ziding Zhang

{"title":"Pre-trained protein language model sheds new light on the prediction of Arabidopsis protein–protein interactions","authors":"Kewei Zhou, Chenping Lei, Jingyan Zheng, Yan Huang, Ziding Zhang","doi":"10.1186/s13007-023-01119-6","DOIUrl":null,"url":null,"abstract":"Protein–protein interactions (PPIs) are heavily involved in many biological processes. Consequently, the identification of PPIs in the model plant Arabidopsis is of great significance to deeply understand plant growth and development, and then to promote the basic research of crop improvement. Although many experimental Arabidopsis PPIs have been determined currently, the known interactomic data of Arabidopsis is far from complete. In this context, developing effective machine learning models from existing PPI data to predict unknown Arabidopsis PPIs conveniently and rapidly is still urgently needed. We used a large-scale pre-trained protein language model (pLM) called ESM-1b to convert protein sequences into high-dimensional vectors and then used them as the input of multilayer perceptron (MLP). To avoid the performance overestimation frequently occurring in PPI prediction, we employed stringent datasets to train and evaluate the predictive model. The results showed that the combination of ESM-1b and MLP (i.e., ESMAraPPI) achieved more accurate performance than the predictive models inferred from other pLMs or baseline sequence encoding schemes. In particular, the proposed ESMAraPPI yielded an AUPR value of 0.810 when tested on an independent test set where both proteins in each protein pair are unseen in the training dataset, suggesting its strong generalization and extrapolating ability. Moreover, the proposed ESMAraPPI model performed better than several state-of-the-art generic or plant-specific PPI predictors. Protein sequence embeddings from the pre-trained model ESM-1b contain rich protein semantic information. By combining with the MLP algorithm, ESM-1b revealed excellent performance in predicting Arabidopsis PPIs. We anticipate that the proposed predictive model (ESMAraPPI) can serve as a very competitive tool to accelerate the identification of Arabidopsis interactome.","PeriodicalId":20100,"journal":{"name":"Plant Methods","volume":"49 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Methods","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13007-023-01119-6","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Protein–protein interactions (PPIs) are heavily involved in many biological processes. Consequently, the identification of PPIs in the model plant Arabidopsis is of great significance to deeply understand plant growth and development, and then to promote the basic research of crop improvement. Although many experimental Arabidopsis PPIs have been determined currently, the known interactomic data of Arabidopsis is far from complete. In this context, developing effective machine learning models from existing PPI data to predict unknown Arabidopsis PPIs conveniently and rapidly is still urgently needed. We used a large-scale pre-trained protein language model (pLM) called ESM-1b to convert protein sequences into high-dimensional vectors and then used them as the input of multilayer perceptron (MLP). To avoid the performance overestimation frequently occurring in PPI prediction, we employed stringent datasets to train and evaluate the predictive model. The results showed that the combination of ESM-1b and MLP (i.e., ESMAraPPI) achieved more accurate performance than the predictive models inferred from other pLMs or baseline sequence encoding schemes. In particular, the proposed ESMAraPPI yielded an AUPR value of 0.810 when tested on an independent test set where both proteins in each protein pair are unseen in the training dataset, suggesting its strong generalization and extrapolating ability. Moreover, the proposed ESMAraPPI model performed better than several state-of-the-art generic or plant-specific PPI predictors. Protein sequence embeddings from the pre-trained model ESM-1b contain rich protein semantic information. By combining with the MLP algorithm, ESM-1b revealed excellent performance in predicting Arabidopsis PPIs. We anticipate that the proposed predictive model (ESMAraPPI) can serve as a very competitive tool to accelerate the identification of Arabidopsis interactome.

查看原文本刊更多论文

预先训练的蛋白质语言模型为预测拟南芥蛋白质与蛋白质之间的相互作用提供了新启示

蛋白质与蛋白质之间的相互作用（PPIs）在许多生物过程中都有重要参与。因此，鉴定模式植物拟南芥中的 PPIs 对深入了解植物的生长发育，进而促进作物改良的基础研究具有重要意义。尽管目前已经确定了许多拟南芥的实验性 PPIs，但已知的拟南芥互作组数据还远远不够完整。在这种情况下，从现有的PPI数据中开发有效的机器学习模型，以方便快捷地预测未知的拟南芥PPIs仍然是迫切需要解决的问题。我们使用一种名为ESM-1b的大规模预训练蛋白质语言模型（pLM）将蛋白质序列转换成高维向量，然后将其作为多层感知器（MLP）的输入。为了避免在 PPI 预测中经常出现的性能高估，我们采用了严格的数据集来训练和评估预测模型。结果表明，ESM-1b 和 MLP 的组合（即 ESMAraPPI）比从其他 pLM 或基线序列编码方案推断出的预测模型获得了更精确的性能。特别是在独立测试集上进行测试时，ESMAraPPI 的 AUPR 值达到了 0.810，这表明 ESMAraPPI 具有很强的泛化和外推能力。此外，ESMAraPPI 模型的表现优于几种最先进的通用或植物特异性 PPI 预测模型。来自预训练模型 ESM-1b 的蛋白质序列嵌入包含丰富的蛋白质语义信息。通过与 MLP 算法相结合，ESM-1b 在预测拟南芥 PPI 方面表现出色。我们预计所提出的预测模型（ESMAraPPI）可以作为一种非常有竞争力的工具，加速拟南芥相互作用组的鉴定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Plant Methods 生物-植物科学

CiteScore

9.20

自引率

3.90%

发文量

121

审稿时长

2 months

期刊介绍： Plant Methods is an open access, peer-reviewed, online journal for the plant research community that encompasses all aspects of technological innovation in the plant sciences. There is no doubt that we have entered an exciting new era in plant biology. The completion of the Arabidopsis genome sequence, and the rapid progress being made in other plant genomics projects are providing unparalleled opportunities for progress in all areas of plant science. Nevertheless, enormous challenges lie ahead if we are to understand the function of every gene in the genome, and how the individual parts work together to make the whole organism. Achieving these goals will require an unprecedented collaborative effort, combining high-throughput, system-wide technologies with more focused approaches that integrate traditional disciplines such as cell biology, biochemistry and molecular genetics. Technological innovation is probably the most important catalyst for progress in any scientific discipline. Plant Methods’ goal is to stimulate the development and adoption of new and improved techniques and research tools and, where appropriate, to promote consistency of methodologies for better integration of data from different laboratories.