Comparison study of dominant molecular sequence representation based on diffusion model

IF 3.1 3区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY

Journal of Computer-Aided Molecular Design Pub Date : 2025-07-18 DOI:10.1007/s10822-025-00614-3

Yongrui Cui, Dongjing Shan, Qiheng Lu, Beijia Zou, Huali Zhang, Jin Li, Jiashun Mao

{"title":"Comparison study of dominant molecular sequence representation based on diffusion model","authors":"Yongrui Cui, Dongjing Shan, Qiheng Lu, Beijia Zou, Huali Zhang, Jin Li, Jiashun Mao","doi":"10.1007/s10822-025-00614-3","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC’s primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.</p></div>","PeriodicalId":621,"journal":{"name":"Journal of Computer-Aided Molecular Design","volume":"39 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer-Aided Molecular Design","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s10822-025-00614-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC’s primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.

查看原文本刊更多论文

基于扩散模型的优势分子序列表示的比较研究。

近年来，大型语言模型（llm）的出现，特别是ChatGPT的出现，将基于自然语言序列的表示学习和生成模型定位为科学人工智能的主要研究范式。在药物发现和计算化学领域，化合物表示学习和分子生成是两个最重要的任务。目前，用于分子表征和生成的主要分子表示序列包括SMILES（简化分子输入行输入系统）、selfie（自引用嵌入式字符串）、SMARTS （SMILES任意目标规范）和IUPAC（国际纯粹与应用化学联合会）命名法。在人工智能辅助药物设计的背景下，每一种分子语言都有自己的优势和劣势，不同分子表示形式编码的信息粒度差异很大。然而，选择合适的分子表示作为模型训练的输入格式是至关重要的，但这个问题尚未得到彻底的探讨。此外，目前用于分子生成和优化的最先进的模型是扩散模型。因此，本研究探讨了在同一扩散模型中用于训练生成分子集的四种主流分子表示语言的特征。首先，通过不同的方法以四种不同的方式表示单个分子，然后使用相同的参数训练去噪扩散模型。随后，生成3万个分子用于评估和分析。结果表明，4种分子表示语言在属性分布和空间分布上既有相似性又有差异性；值得注意的是，自拍和SMARTS表现出高度的相似性，而IUPAC和SMILES则表现出巨大的差异。此外，IUPAC的主要优势在于生成分子的新颖性和多样性，而SMILES在QEPPI和SAscore指标上表现出色，自拍和SMARTS在QED指标上表现最佳。这项研究的结果将为人工智能药物设计任务中分子表征的选择提供重要见解，从而有助于提高药物开发的效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computer-Aided Molecular Design 生物-计算机：跨学科应用

CiteScore

8.00

自引率

8.60%

发文量

审稿时长

3 months

期刊介绍： The Journal of Computer-Aided Molecular Design provides a form for disseminating information on both the theory and the application of computer-based methods in the analysis and design of molecules. The scope of the journal encompasses papers which report new and original research and applications in the following areas: - theoretical chemistry; - computational chemistry; - computer and molecular graphics; - molecular modeling; - protein engineering; - drug design; - expert systems; - general structure-property relationships; - molecular dynamics; - chemical database development and usage.