Jiashun Mao, Tang Sui, Kwang-Hwi Cho, Kyoung Tai No, Jianmin Wang, Dongjing Shan
{"title":"IUPAC-GPT:基于iupac的大规模分子预训练模型,用于性质预测和分子生成。","authors":"Jiashun Mao, Tang Sui, Kwang-Hwi Cho, Kyoung Tai No, Jianmin Wang, Dongjing Shan","doi":"10.1007/s11030-025-11280-w","DOIUrl":null,"url":null,"abstract":"<p><p>The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.\",\"authors\":\"Jiashun Mao, Tang Sui, Kwang-Hwi Cho, Kyoung Tai No, Jianmin Wang, Dongjing Shan\",\"doi\":\"10.1007/s11030-025-11280-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.</p>\",\"PeriodicalId\":708,\"journal\":{\"name\":\"Molecular Diversity\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Diversity\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1007/s11030-025-11280-w\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-025-11280-w","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0
摘要
国际纯粹与应用化学联合会(IUPAC)命名法是一种普遍认可的标准命名系统,用于分配化合物的名称,是一种对人类友好的亚结构分子语言。简化分子输入行输入系统(Simplified molecular input line entry system, SMILES)字符串是目前最流行的分子表示语言,是一种计算机友好的原子级分子语言。考虑到IUPAC名称的可读性和SMILES字符串的优势,研究这两种分子语言在分子生成和回归/分类任务方面的差异具有重要意义。因此,我们开发了一个名为IUPAC-GPT的化学语言模型。除了分子生成,我们还结合了IUPAC-GPT模型参数的冻结和可训练的轻量级网络的附加,用于微调回归/分类任务。结果表明,预训练的IUPAC-GPT能够掌握一般知识,并能有效地转移到下游任务中,如分子生成、二值分类和性质回归预测。此外,当使用相同的配置时,IUPAC-GPT在一些属性预测任务方面表现出优于smilesGPT模型的性能。总的来说,与在SMILES语料库上预训练的模型相比,在IUPAC语料库上预训练的类似转换器的语言模型在可解释性和语义抽象(化学组和修改)方面提供了更好的性能。
IUPAC-GPT: an IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation.
The international union of pure and applied chemistry (IUPAC) name nomenclature constitutes a universally recognized standard naming system for allocating names to chemical compounds and is a human-friendly, substructure molecular language. Simplified molecular input line entry system (SMILES) string is currently the most popular molecular representation language and is a computer-friendly, atomic-level molecular language. Considering the readability of IUPAC name and the advantages of SMILES string, it becomes significant to investigate the distinctions of these two molecular languages in term of molecular generation and regression/classification tasks. Thus, we have developed a chemical language model named IUPAC-GPT. Besides molecular generation, we have also incorporated the freezing of IUPAC-GPT model parameters and the attachment of trainable lightweight networks for fine-tuning regression/classification tasks. The results indicate that pre-trained IUPAC-GPT can grasp general knowledge that can be effectively transferred to downstream tasks such as molecular generation, binary classification, and property regression prediction. Furthermore, when utilizing the same configuration, IUPAC-GPT exhibited superior performance compared to the smilesGPT model in term of some property prediction tasks. Overall, transformer-like language models pretrained on IUPAC corpora emerge as promising alternatives, offering improved performance in terms of interpretability and semantic abstraction (chemical groups and modifications) when compared to models pretrained on SMILES corpora.
期刊介绍:
Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including:
combinatorial chemistry and parallel synthesis;
small molecule libraries;
microwave synthesis;
flow synthesis;
fluorous synthesis;
diversity oriented synthesis (DOS);
nanoreactors;
click chemistry;
multiplex technologies;
fragment- and ligand-based design;
structure/function/SAR;
computational chemistry and molecular design;
chemoinformatics;
screening techniques and screening interfaces;
analytical and purification methods;
robotics, automation and miniaturization;
targeted libraries;
display libraries;
peptides and peptoids;
proteins;
oligonucleotides;
carbohydrates;
natural diversity;
new methods of library formulation and deconvolution;
directed evolution, origin of life and recombination;
search techniques, landscapes, random chemistry and more;