Semi-supervised prediction of protein fitness for data-driven protein engineering

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2025-05-31 DOI:10.1186/s13321-025-01029-w

Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari

{"title":"Semi-supervised prediction of protein fitness for data-driven protein engineering","authors":"Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari","doi":"10.1186/s13321-025-01029-w","DOIUrl":null,"url":null,"abstract":"Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. We explore several semi-supervised learning strategies capable of including the homologous sequences (unlabelled) to the protein of interest in the training process. Among them, we present two new methods to exploit the information in the homologous sequences: i) a new generalised version of MERGE capable of employing any regressor as a base estimator; ii) the Tri-Training Regressor method, an adaptation of the Tri-Training method for regression problems. We find that the information inherent in the homologous sequences has the ability to improve the predictive capacity of models when the number of available sequences is scarce, especially when using the DCA encoding together with MERGE and an SVM regressor.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1186/s13321-025-01029-w","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. We explore several semi-supervised learning strategies capable of including the homologous sequences (unlabelled) to the protein of interest in the training process. Among them, we present two new methods to exploit the information in the homologous sequences: i) a new generalised version of MERGE capable of employing any regressor as a base estimator; ii) the Tri-Training Regressor method, an adaptation of the Tri-Training method for regression problems. We find that the information inherent in the homologous sequences has the ability to improve the predictive capacity of models when the number of available sequences is scarce, especially when using the DCA encoding together with MERGE and an SVM regressor.

查看原文本刊更多论文

数据驱动蛋白质工程中蛋白质适应度的半监督预测

蛋白质适应度预测在蛋白质工程研究中起着至关重要的作用。然而，蛋白质序列空间的组合复杂性和测定标记数据的有限可用性阻碍了蛋白质特性的有效优化。利用机器学习方法的数据驱动策略已经成为一种很有前途的解决方案，但它们对标记训练数据集的依赖构成了一个重大障碍。为了克服这一挑战，在这项工作中，我们探索了将进化相关序列（同源序列）中存在的潜在信息引入训练过程的各种方法。为此，我们建立了几种基于半监督学习的策略（无监督预处理和包装方法），并使用包含蛋白质适应度对的19个数据集进行了全面的比较。我们的研究结果表明，利用同源序列中存在的信息可以提高模型的性能，特别是当可用的标记序列数量相当低时。具体来说，基于直接耦合分析（DCA）的序列编码方法、MERGE（一种结合进化信息和监督学习的混合回归框架）和SVM回归器的组合优于其他编码（PAM250、UniRep、eUniRep）和其他半监督包装方法（三训练回归器、共同训练回归器）。综上所述，该策略的性能提升标志着蛋白质工程任务预测模型朝着更稳健、更可靠的方向迈出了实质性的一步。这一进展有可能简化蛋白质的设计和优化，以用于生物技术和治疗学的各种应用。我们探索了几种半监督学习策略，能够在训练过程中包括感兴趣的蛋白质的同源序列（未标记）。其中，我们提出了两种新的方法来利用同源序列中的信息：i)一种新的通用版本的MERGE，能够使用任何回归量作为基估计量；ii) Tri-Training Regressor method，这是对Tri-Training方法的改进，用于解决回归问题。我们发现，当可用序列数量不足时，同源序列固有的信息能够提高模型的预测能力，特别是当将DCA编码与MERGE和SVM回归器结合使用时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.