NeSyDPP-4: discovering DPP-4 inhibitors for diabetes treatment with a neuro-symbolic AI approach.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Frontiers in bioinformatics Pub Date : 2025-07-21 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1603133
Delower Hossain, Ehsan Saghapour, Jake Y Chen
{"title":"NeSyDPP-4: discovering DPP-4 inhibitors for diabetes treatment with a neuro-symbolic AI approach.","authors":"Delower Hossain, Ehsan Saghapour, Jake Y Chen","doi":"10.3389/fbinf.2025.1603133","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Diabetes Mellitus (DM) constitutes a global epidemic and is one of the top ten leading causes of mortality (WHO, 2019), projected to rank seventh by 2030. The US National Diabetes Statistics Report (2021) states that 38.4 million Americans have diabetes. Dipeptidyl Peptidase-4 (DPP-4) is an FDA-approved target for the treatment of type 2 diabetes mellitus (T2DM). However, current DPP-4 inhibitors may cause adverse effects, including gastrointestinal issues, severe joint pain (FDA safety warning), nasopharyngitis, hypersensitivity, and nausea. Moreover, the development of novel drugs and the <i>in vivo</i> assessment of DPP-4 inhibition are both costly and often impractical. These challenges highlight the urgent need for efficient <i>in-silico</i> approaches to facilitate the discovery and optimization of safer and more effective DPP-4 inhibitors.</p><p><strong>Methodology: </strong>Quantitative Structure-Activity Relationship (QSAR) modeling is a widely used computational approach for evaluating the properties of chemical substances. In this study, we employed a Neuro-symbolic (NeSy) approach, specifically the Logic Tensor Network (LTN), to develop a DPP-4 QSAR model capable of identifying potential small-molecule inhibitors and predicting bioactivity classification. For comparison, we also implemented baseline models using Deep Neural Networks (DNNs) and Transformers. A total of 6,563 bioactivity records (SMILES-based compounds with IC<sub>50</sub> values) were collected from ChEMBL, PubChem, BindingDB, and GTP. Feature sets used for model training included descriptors (CDK Extended-PaDEL), fingerprints (Morgan), chemical language model embeddings (ChemBERTa-2), LLaMa 3.2 embedding features, and physicochemical properties.</p><p><strong>Results: </strong>Among all tested configurations, the Neuro-symbolic QSAR model (NeSyDPP-4) performed best using a combination of CDK extended and Morgan fingerprints. The model achieved an accuracy of 0.9725, an F1-score of 0.9723, an ROC AUC of 0.9719, and a Matthews correlation coefficient (MCC) of 0.9446. These results outperformed the baseline DNN and Transformer models, as well as existing state-of-the-art (SOTA) methods. To further validate the robustness of the model, we conducted an external evaluation using the Drug Target Common (DTC) dataset, where NeSyDPP-4 also demonstrated strong performance, with an accuracy of 0.9579, an AUC-ROC of 0.9565, a Matthews Correlation Coefficient (MCC) of 0.9171, and an F1-score of 0.9577.</p><p><strong>Discussion: </strong>These findings suggest that the NeSyDPP-4 model not only delivered high predictive performance but also demonstrated generalizability to external datasets. This approach presents a cost-effective and reliable alternative to traditional vivo screening, offering valuable support for the identification and classification of biologically active DPP-4 inhibitors in the treatment of type 2 diabetes mellitus (T2DM).</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1603133"},"PeriodicalIF":3.9000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12319772/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1603133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Diabetes Mellitus (DM) constitutes a global epidemic and is one of the top ten leading causes of mortality (WHO, 2019), projected to rank seventh by 2030. The US National Diabetes Statistics Report (2021) states that 38.4 million Americans have diabetes. Dipeptidyl Peptidase-4 (DPP-4) is an FDA-approved target for the treatment of type 2 diabetes mellitus (T2DM). However, current DPP-4 inhibitors may cause adverse effects, including gastrointestinal issues, severe joint pain (FDA safety warning), nasopharyngitis, hypersensitivity, and nausea. Moreover, the development of novel drugs and the in vivo assessment of DPP-4 inhibition are both costly and often impractical. These challenges highlight the urgent need for efficient in-silico approaches to facilitate the discovery and optimization of safer and more effective DPP-4 inhibitors.

Methodology: Quantitative Structure-Activity Relationship (QSAR) modeling is a widely used computational approach for evaluating the properties of chemical substances. In this study, we employed a Neuro-symbolic (NeSy) approach, specifically the Logic Tensor Network (LTN), to develop a DPP-4 QSAR model capable of identifying potential small-molecule inhibitors and predicting bioactivity classification. For comparison, we also implemented baseline models using Deep Neural Networks (DNNs) and Transformers. A total of 6,563 bioactivity records (SMILES-based compounds with IC50 values) were collected from ChEMBL, PubChem, BindingDB, and GTP. Feature sets used for model training included descriptors (CDK Extended-PaDEL), fingerprints (Morgan), chemical language model embeddings (ChemBERTa-2), LLaMa 3.2 embedding features, and physicochemical properties.

Results: Among all tested configurations, the Neuro-symbolic QSAR model (NeSyDPP-4) performed best using a combination of CDK extended and Morgan fingerprints. The model achieved an accuracy of 0.9725, an F1-score of 0.9723, an ROC AUC of 0.9719, and a Matthews correlation coefficient (MCC) of 0.9446. These results outperformed the baseline DNN and Transformer models, as well as existing state-of-the-art (SOTA) methods. To further validate the robustness of the model, we conducted an external evaluation using the Drug Target Common (DTC) dataset, where NeSyDPP-4 also demonstrated strong performance, with an accuracy of 0.9579, an AUC-ROC of 0.9565, a Matthews Correlation Coefficient (MCC) of 0.9171, and an F1-score of 0.9577.

Discussion: These findings suggest that the NeSyDPP-4 model not only delivered high predictive performance but also demonstrated generalizability to external datasets. This approach presents a cost-effective and reliable alternative to traditional vivo screening, offering valuable support for the identification and classification of biologically active DPP-4 inhibitors in the treatment of type 2 diabetes mellitus (T2DM).

Abstract Image

Abstract Image

Abstract Image

NeSyDPP-4:用神经符号人工智能方法发现DPP-4抑制剂治疗糖尿病。
导言:糖尿病(DM)是一种全球流行病,是十大主要死亡原因之一(世卫组织,2019年),预计到2030年将排名第七。美国国家糖尿病统计报告(2021年)指出,3840万美国人患有糖尿病。二肽基肽酶-4 (DPP-4)是fda批准的治疗2型糖尿病(T2DM)的靶点。然而,目前的DPP-4抑制剂可能会引起不良反应,包括胃肠道问题、严重关节疼痛(FDA安全警告)、鼻咽炎、过敏和恶心。此外,新药的开发和DPP-4抑制的体内评估既昂贵又往往不切实际。这些挑战凸显了对高效硅方法的迫切需求,以促进发现和优化更安全、更有效的DPP-4抑制剂。方法:定量构效关系(QSAR)模型是一种广泛应用于评价化学物质性质的计算方法。在这项研究中,我们采用神经符号(NeSy)方法,特别是逻辑张量网络(LTN),开发了一个能够识别潜在小分子抑制剂并预测生物活性分类的DPP-4 QSAR模型。为了比较,我们还使用深度神经网络(dnn)和变压器实现了基线模型。从ChEMBL、PubChem、BindingDB和GTP共收集了6,563个生物活性记录(基于smiles的IC50值化合物)。用于模型训练的特征集包括描述符(CDK Extended-PaDEL)、指纹(Morgan)、化学语言模型嵌入(ChemBERTa-2)、LLaMa 3.2嵌入特征和物理化学性质。结果:在所有测试配置中,神经符号QSAR模型(NeSyDPP-4)在使用CDK扩展指纹和Morgan指纹组合时表现最佳。模型的准确率为0.9725,f1得分为0.9723,ROC AUC为0.9719,Matthews相关系数(MCC)为0.9446。这些结果优于基线DNN和Transformer模型,以及现有的最先进的(SOTA)方法。为了进一步验证模型的稳健性,我们使用药物靶标共同(Drug Target Common, DTC)数据集进行了外部评价,其中NeSyDPP-4也表现出了较强的性能,准确率为0.9579,AUC-ROC为0.9565,马修斯相关系数(Matthews Correlation Coefficient, MCC)为0.9171,f1得分为0.9577。讨论:这些发现表明NeSyDPP-4模型不仅提供了高预测性能,而且还展示了对外部数据集的通用性。该方法是传统体内筛选的一种经济可靠的替代方法,为2型糖尿病(T2DM)治疗中生物活性DPP-4抑制剂的鉴定和分类提供了有价值的支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信