Yongrui Cui, Dongjing Shan, Qiheng Lu, Beijia Zou, Huali Zhang, Jin Li, Jiashun Mao
{"title":"Comparison study of dominant molecular sequence representation based on diffusion model.","authors":"Yongrui Cui, Dongjing Shan, Qiheng Lu, Beijia Zou, Huali Zhang, Jin Li, Jiashun Mao","doi":"10.1007/s10822-025-00614-3","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.</p>","PeriodicalId":621,"journal":{"name":"Journal of Computer-Aided Molecular Design","volume":"39 1","pages":"54"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer-Aided Molecular Design","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s10822-025-00614-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.
期刊介绍:
The Journal of Computer-Aided Molecular Design provides a form for disseminating information on both the theory and the application of computer-based methods in the analysis and design of molecules. The scope of the journal encompasses papers which report new and original research and applications in the following areas:
- theoretical chemistry;
- computational chemistry;
- computer and molecular graphics;
- molecular modeling;
- protein engineering;
- drug design;
- expert systems;
- general structure-property relationships;
- molecular dynamics;
- chemical database development and usage.