{"title":"ECNU-ChemGPT: A Large Language Model for Chemistry and Retrosynthesis Predictions","authors":"Yueqing Zhang, Wentao Liu, Yan Zhang, Danyang Xiong, Jihang Zhai, Hao Hao, Jiaxi Zhuang, Hui Wang, Yucheng Gu, Haibo Yang, Shuanhu Gao, Lianrui Hu, Aimin Zhou, Xiao He","doi":"10.31635/ccschem.026.202607359","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have achieved impressive progress across a broad range of general-purpose tasks, but their effectiveness in chemistry remains limited due to scarce domain-specific datasets, and the demand for precise symbolic and structural reasoning. Here we introduce ECNU-ChemGPT (East China Normal University Chemistry GPT model), a chemistry-specialized LLM engineered for deep chemical knowledge understanding and accurate retrosynthetic route planning. Our approach is distinguished by four key strategies: structured prompt-based knowledge distillation from authoritative chemistry textbooks to construct a high-quality question-answering dataset; domain-specific prompt engineering using curated chemical keywords, combined with LLM APIs for data derivation and knowledge distillation; fine-tuning on a meticulously cleaned Pistachio reaction dataset to enhance retrosynthesis prediction accuracy; and integration of BrainGPT (a brain-inspired multi-model scheduling framework), a dynamic multi-model scheduling framework that enables task-specific invocation of multiple specialized models trained for diverse chemistryrelated tasks. ECNU-ChemGPT exhibits superior performance on chemistry questionanswering and retrosynthetic planning benchmarks, outperforming leading generalpurpose models-including DeepSeek-R1, Qwen-2.5, and GPT-4o. In retrosynthesis, it achieves a Top-1 accuracy of 68.3% on the USPTO-50K dataset and successfully reproduces seven complete drug synthesis routes reported in the literatures and patents. These results underscore the effectiveness of domain-adapted LLM fine-tuning combined with dynamic multi-model task scheduling, providing a scalable and robust solution for chemical knowledge question answering, and retrosynthetic planning.","PeriodicalId":9810,"journal":{"name":"CCS Chemistry","volume":"70 1","pages":""},"PeriodicalIF":9.2000,"publicationDate":"2026-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CCS Chemistry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31635/ccschem.026.202607359","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have achieved impressive progress across a broad range of general-purpose tasks, but their effectiveness in chemistry remains limited due to scarce domain-specific datasets, and the demand for precise symbolic and structural reasoning. Here we introduce ECNU-ChemGPT (East China Normal University Chemistry GPT model), a chemistry-specialized LLM engineered for deep chemical knowledge understanding and accurate retrosynthetic route planning. Our approach is distinguished by four key strategies: structured prompt-based knowledge distillation from authoritative chemistry textbooks to construct a high-quality question-answering dataset; domain-specific prompt engineering using curated chemical keywords, combined with LLM APIs for data derivation and knowledge distillation; fine-tuning on a meticulously cleaned Pistachio reaction dataset to enhance retrosynthesis prediction accuracy; and integration of BrainGPT (a brain-inspired multi-model scheduling framework), a dynamic multi-model scheduling framework that enables task-specific invocation of multiple specialized models trained for diverse chemistryrelated tasks. ECNU-ChemGPT exhibits superior performance on chemistry questionanswering and retrosynthetic planning benchmarks, outperforming leading generalpurpose models-including DeepSeek-R1, Qwen-2.5, and GPT-4o. In retrosynthesis, it achieves a Top-1 accuracy of 68.3% on the USPTO-50K dataset and successfully reproduces seven complete drug synthesis routes reported in the literatures and patents. These results underscore the effectiveness of domain-adapted LLM fine-tuning combined with dynamic multi-model task scheduling, providing a scalable and robust solution for chemical knowledge question answering, and retrosynthetic planning.
期刊介绍:
CCS Chemistry, the flagship publication of the Chinese Chemical Society, stands as a leading international chemistry journal based in China. With a commitment to global outreach in both contributions and readership, the journal operates on a fully Open Access model, eliminating subscription fees for contributing authors. Issued monthly, all articles are published online promptly upon reaching final publishable form. Additionally, authors have the option to expedite the posting process through Immediate Online Accepted Article posting, making a PDF of their accepted article available online upon journal acceptance.