DeepAllo: Allosteric Site Prediction using Protein Language Model (pLM) with Multitask Learning.

Bioinformatics (Oxford, England) Pub Date : 2025-05-15 DOI:10.1093/bioinformatics/btaf294

Moaaz Khokhar, Ozlem Keskin, Attila Gursoy

{"title":"DeepAllo: Allosteric Site Prediction using Protein Language Model (pLM) with Multitask Learning.","authors":"Moaaz Khokhar, Ozlem Keskin, Attila Gursoy","doi":"10.1093/bioinformatics/btaf294","DOIUrl":null,"url":null,"abstract":"Motivation: Allostery, the process by which binding at one site perturbs a distant site, is being rendered as a key focus in the field of drug development with its substantial impact on protein function. The identification of allosteric pockets (sites) is a challenging task and several techniques have been developed, including Machine Learning (ML) to predict allosteric pockets that utilize both static and pocket features.Results: Our work, DeepAllo, is the first study that combines fine-tuned protein language model (pLM) with FPocket features and shows an increase in prediction performance of allosteric sites over previous studies. The pLM model was fine-tuned on Allosteric Dataset (ASD) in Multitask Learning (MTL) setting and was further used as a feature extractor to train XGBoost and AutoML models. The best model predicts allosteric pockets with 89.66% F1 score and 90.5% of allosteric pockets in the top 3 positions, outperforming previous results. A case study has been performed on proteins with known allosteric pockets, which shows the proof of our approach. Moreover, an effort was made to explain the pLM by visualizing its attention mechanism among allosteric and non-allosteric residues.Availability: The source code is available on GitHub (https://github.com/MoaazK/deepallo) and archived on Zenodo (DOI: 10.5281/zenodo.15255379). The trained model is hosted on Hugging Face (DOI: 10.57967/hf/5198). The dataset used for training and evaluation is archived on Zenodo (DOI: 10.5281/zenodo.15255437).Supplementary information: Supplementary data, including the full list of proteins used in the study with their PDB IDs, t-SNE analysis of pocket features, confusion matrix breakdown, and interpretation of borderline classification cases are available as supplementary material along this article.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Allostery, the process by which binding at one site perturbs a distant site, is being rendered as a key focus in the field of drug development with its substantial impact on protein function. The identification of allosteric pockets (sites) is a challenging task and several techniques have been developed, including Machine Learning (ML) to predict allosteric pockets that utilize both static and pocket features.

Results: Our work, DeepAllo, is the first study that combines fine-tuned protein language model (pLM) with FPocket features and shows an increase in prediction performance of allosteric sites over previous studies. The pLM model was fine-tuned on Allosteric Dataset (ASD) in Multitask Learning (MTL) setting and was further used as a feature extractor to train XGBoost and AutoML models. The best model predicts allosteric pockets with 89.66% F1 score and 90.5% of allosteric pockets in the top 3 positions, outperforming previous results. A case study has been performed on proteins with known allosteric pockets, which shows the proof of our approach. Moreover, an effort was made to explain the pLM by visualizing its attention mechanism among allosteric and non-allosteric residues.

Availability: The source code is available on GitHub (https://github.com/MoaazK/deepallo) and archived on Zenodo (DOI: 10.5281/zenodo.15255379). The trained model is hosted on Hugging Face (DOI: 10.57967/hf/5198). The dataset used for training and evaluation is archived on Zenodo (DOI: 10.5281/zenodo.15255437).

Supplementary information: Supplementary data, including the full list of proteins used in the study with their PDB IDs, t-SNE analysis of pocket features, confusion matrix breakdown, and interpretation of borderline classification cases are available as supplementary material along this article.

查看原文本刊更多论文

DeepAllo：使用多任务学习的蛋白质语言模型（pLM）进行变构位点预测。

动机：变构，即一个位点的结合干扰另一个位点的过程，由于其对蛋白质功能的重大影响，正成为药物开发领域的一个关键焦点。识别变构口袋（位点）是一项具有挑战性的任务，已经开发了几种技术，包括机器学习（ML）来预测利用静态和口袋特征的变构口袋。我们的工作，DeepAllo，是第一个将微调蛋白质语言模型（pLM）与FPocket特征相结合的研究，并显示比以前的研究更能预测变构位点。将pLM模型在多任务学习（Multitask Learning， MTL）环境下的Allosteric Dataset （ASD）上进行微调，并将其作为特征提取器用于训练XGBoost和AutoML模型。最佳模型预测变构口袋的F1得分为89.66%，前3位的变构口袋预测率为90.5%，优于以往的结果。对已知变构口袋的蛋白质进行了案例研究，证明了我们的方法。此外，我们还试图通过可视化变构残基和非变构残基之间的注意机制来解释pLM。可用性：源代码可在GitHub （https://github.com/MoaazK/deepallo）上获得，并在Zenodo上存档（DOI: 10.5281/ Zenodo .15255379）。训练后的模型托管在hug Face （DOI: 10.57967/hf/5198）上。用于训练和评估的数据集存档在Zenodo上（DOI: 10.5281/ Zenodo .15255437）。补充信息：补充数据，包括研究中使用的蛋白质的完整列表及其PDB id，口袋特征的t-SNE分析，混淆矩阵分解和边缘分类病例的解释，可作为本文的补充材料。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量