GDMol: Generative Double-Masking Self-Supervised Learning for Molecular Property Prediction.

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics Pub Date : 2025-01-01 Epub Date: 2024-10-24 DOI:10.1002/minf.202400146

Yingxu Liu, Qing Fan, Chengcheng Xu, Xiangzhen Ning, Yu Wang, Yang Liu, Yu Xie, Yanmin Zhang, Yadong Chen, Haichun Liu

{"title":"GDMol: Generative Double-Masking Self-Supervised Learning for Molecular Property Prediction.","authors":"Yingxu Liu, Qing Fan, Chengcheng Xu, Xiangzhen Ning, Yu Wang, Yang Liu, Yu Xie, Yanmin Zhang, Yadong Chen, Haichun Liu","doi":"10.1002/minf.202400146","DOIUrl":null,"url":null,"abstract":"Background: Effective molecular feature representation is crucial for drug property prediction. Recent years have seen increased attention on graph neural networks (GNNs) that are pre-trained using self-supervised learning techniques, aiming to overcome the scarcity of labeled data in molecular property prediction. Traditional GNNs in self-supervised molecular property prediction typically perform a single masking operation on the nodes and edges of the input molecular graph, masking only local information and insufficient for thorough self-supervised training.Method: Hence, we propose a model for molecular property prediction based on generative double-masking self-supervised learning, termed as GDMol. This integrates generative learning into the self-supervised learning framework for latent representation, and applies a second round of masking to these latent representations, enabling the model to better capture global information and semantic knowledge of the molecules for a richer, more informative representation, thereby achieving more accurate and robust molecular property prediction.Results: Our experiments on 5 datasets demonstrated superior performance of GDMol in predicting molecular properties across different domains. Moreover, we used the masking operation to traverse through the gradient changes of each node, the magnitude and sign of which reflect the positive and negative contribution respectively of the local structure in the molecule to the prediction outcome. This in-depth interpretative analysis not only enhances the model's interpretability, but also provides more targeted insights and direction for optimizing drug molecules.Conclusions: In summary, this research offers novel insights on improving molecular property prediction tasks, and paves the way for further research on the application of generative learning and self-supervised learning in the field of chemistry.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400146"},"PeriodicalIF":3.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202400146","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/24 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Effective molecular feature representation is crucial for drug property prediction. Recent years have seen increased attention on graph neural networks (GNNs) that are pre-trained using self-supervised learning techniques, aiming to overcome the scarcity of labeled data in molecular property prediction. Traditional GNNs in self-supervised molecular property prediction typically perform a single masking operation on the nodes and edges of the input molecular graph, masking only local information and insufficient for thorough self-supervised training.

Method: Hence, we propose a model for molecular property prediction based on generative double-masking self-supervised learning, termed as GDMol. This integrates generative learning into the self-supervised learning framework for latent representation, and applies a second round of masking to these latent representations, enabling the model to better capture global information and semantic knowledge of the molecules for a richer, more informative representation, thereby achieving more accurate and robust molecular property prediction.

Results: Our experiments on 5 datasets demonstrated superior performance of GDMol in predicting molecular properties across different domains. Moreover, we used the masking operation to traverse through the gradient changes of each node, the magnitude and sign of which reflect the positive and negative contribution respectively of the local structure in the molecule to the prediction outcome. This in-depth interpretative analysis not only enhances the model's interpretability, but also provides more targeted insights and direction for optimizing drug molecules.

Conclusions: In summary, this research offers novel insights on improving molecular property prediction tasks, and paves the way for further research on the application of generative learning and self-supervised learning in the field of chemistry.

查看原文本刊更多论文

GDMol：用于分子特性预测的生成式双掩蔽自我监督学习。

背景：有效的分子特征表示对于药物性质预测至关重要。近年来，使用自我监督学习技术预先训练的图神经网络（GNN）受到越来越多的关注，其目的是克服分子性质预测中标记数据稀缺的问题。传统的自监督分子性质预测 GNN 通常只对输入分子图的节点和边进行一次屏蔽操作，屏蔽的只是局部信息，不足以进行彻底的自监督训练：因此，我们提出了一种基于生成式双掩蔽自监督学习的分子特性预测模型，称为 GDMol。它将生成学习整合到潜在表征的自我监督学习框架中，并对这些潜在表征进行第二轮掩蔽，使模型能够更好地捕捉分子的全局信息和语义知识，从而获得更丰富、更翔实的表征，从而实现更准确、更稳健的分子性质预测：我们在 5 个数据集上进行的实验表明，GDMol 在预测不同领域的分子特性方面表现出色。此外，我们利用掩码操作遍历了每个节点的梯度变化，其大小和符号分别反映了分子中局部结构对预测结果的正负贡献。这种深入的解释性分析不仅增强了模型的可解释性，还为优化药物分子提供了更有针对性的见解和方向：总之，这项研究为改进分子性质预测任务提供了新的见解，并为生成学习和自监督学习在化学领域的进一步应用研究铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.