Jiashun Mao, Tang Sui, Kwang-Hwi Cho, Kyoung Tai No, Jianmin Wang, Dongjing Shan
{"title":"IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.","authors":"Jiashun Mao, Tang Sui, Kwang-Hwi Cho, Kyoung Tai No, Jianmin Wang, Dongjing Shan","doi":"10.1007/s11030-025-11280-w","DOIUrl":null,"url":null,"abstract":"<p><p>The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-025-11280-w","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0
Abstract
The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.
期刊介绍:
Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including:
combinatorial chemistry and parallel synthesis;
small molecule libraries;
microwave synthesis;
flow synthesis;
fluorous synthesis;
diversity oriented synthesis (DOS);
nanoreactors;
click chemistry;
multiplex technologies;
fragment- and ligand-based design;
structure/function/SAR;
computational chemistry and molecular design;
chemoinformatics;
screening techniques and screening interfaces;
analytical and purification methods;
robotics, automation and miniaturization;
targeted libraries;
display libraries;
peptides and peptoids;
proteins;
oligonucleotides;
carbohydrates;
natural diversity;
new methods of library formulation and deconvolution;
directed evolution, origin of life and recombination;
search techniques, landscapes, random chemistry and more;