ECNU-ChemGPT: A Large Language Model for Chemistry and Retrosynthesis Predictions

IF 9.2 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

CCS Chemistry Pub Date : 2026-04-24 DOI:10.31635/ccschem.026.202607359

Yueqing Zhang, Wentao Liu, Yan Zhang, Danyang Xiong, Jihang Zhai, Hao Hao, Jiaxi Zhuang, Hui Wang, Yucheng Gu, Haibo Yang, Shuanhu Gao, Lianrui Hu, Aimin Zhou, Xiao He

{"title":"ECNU-ChemGPT: A Large Language Model for Chemistry and Retrosynthesis Predictions","authors":"Yueqing Zhang, Wentao Liu, Yan Zhang, Danyang Xiong, Jihang Zhai, Hao Hao, Jiaxi Zhuang, Hui Wang, Yucheng Gu, Haibo Yang, Shuanhu Gao, Lianrui Hu, Aimin Zhou, Xiao He","doi":"10.31635/ccschem.026.202607359","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have achieved impressive progress across a broad range of general-purpose tasks, but their effectiveness in chemistry remains limited due to scarce domain-specific datasets, and the demand for precise symbolic and structural reasoning. Here we introduce ECNU-ChemGPT (East China Normal University Chemistry GPT model), a chemistry-specialized LLM engineered for deep chemical knowledge understanding and accurate retrosynthetic route planning. Our approach is distinguished by four key strategies: structured prompt-based knowledge distillation from authoritative chemistry textbooks to construct a high-quality question-answering dataset; domain-specific prompt engineering using curated chemical keywords, combined with LLM APIs for data derivation and knowledge distillation; fine-tuning on a meticulously cleaned Pistachio reaction dataset to enhance retrosynthesis prediction accuracy; and integration of BrainGPT (a brain-inspired multi-model scheduling framework), a dynamic multi-model scheduling framework that enables task-specific invocation of multiple specialized models trained for diverse chemistryrelated tasks. ECNU-ChemGPT exhibits superior performance on chemistry questionanswering and retrosynthetic planning benchmarks, outperforming leading generalpurpose models-including DeepSeek-R1, Qwen-2.5, and GPT-4o. In retrosynthesis, it achieves a Top-1 accuracy of 68.3% on the USPTO-50K dataset and successfully reproduces seven complete drug synthesis routes reported in the literatures and patents. These results underscore the effectiveness of domain-adapted LLM fine-tuning combined with dynamic multi-model task scheduling, providing a scalable and robust solution for chemical knowledge question answering, and retrosynthetic planning.","PeriodicalId":9810,"journal":{"name":"CCS Chemistry","volume":"70 1","pages":""},"PeriodicalIF":9.2000,"publicationDate":"2026-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CCS Chemistry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31635/ccschem.026.202607359","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have achieved impressive progress across a broad range of general-purpose tasks, but their effectiveness in chemistry remains limited due to scarce domain-specific datasets, and the demand for precise symbolic and structural reasoning. Here we introduce ECNU-ChemGPT (East China Normal University Chemistry GPT model), a chemistry-specialized LLM engineered for deep chemical knowledge understanding and accurate retrosynthetic route planning. Our approach is distinguished by four key strategies: structured prompt-based knowledge distillation from authoritative chemistry textbooks to construct a high-quality question-answering dataset; domain-specific prompt engineering using curated chemical keywords, combined with LLM APIs for data derivation and knowledge distillation; fine-tuning on a meticulously cleaned Pistachio reaction dataset to enhance retrosynthesis prediction accuracy; and integration of BrainGPT (a brain-inspired multi-model scheduling framework), a dynamic multi-model scheduling framework that enables task-specific invocation of multiple specialized models trained for diverse chemistryrelated tasks. ECNU-ChemGPT exhibits superior performance on chemistry questionanswering and retrosynthetic planning benchmarks, outperforming leading generalpurpose models-including DeepSeek-R1, Qwen-2.5, and GPT-4o. In retrosynthesis, it achieves a Top-1 accuracy of 68.3% on the USPTO-50K dataset and successfully reproduces seven complete drug synthesis routes reported in the literatures and patents. These results underscore the effectiveness of domain-adapted LLM fine-tuning combined with dynamic multi-model task scheduling, providing a scalable and robust solution for chemical knowledge question answering, and retrosynthetic planning.

Abstract Image

查看原文本刊更多论文

ECNU-ChemGPT：用于化学和逆向合成预测的大型语言模型

大型语言模型（llm）在广泛的通用任务中取得了令人印象深刻的进展，但由于缺乏特定领域的数据集，以及对精确的符号和结构推理的需求，它们在化学中的有效性仍然有限。在这里，我们介绍ECNU-ChemGPT（华东师范大学化学GPT模型），这是一个化学专业的法学硕士，旨在深入的化学知识理解和准确的反合成路线规划。我们的方法有四个关键策略：从权威化学教科书中结构化的基于提示的知识蒸馏，以构建高质量的问答数据集；使用精心策划的化学关键词进行特定领域的提示工程，并结合LLM api进行数据派生和知识蒸馏；对精心清洗的开心果反应数据集进行微调，以提高反合成预测的准确性；集成了BrainGPT（一种受大脑启发的多模型调度框架），这是一种动态的多模型调度框架，可以针对不同的化学相关任务调用多个专门的模型。ECNU-ChemGPT在化学问答和逆向合成计划基准方面表现优异，优于领先的通用模型，包括DeepSeek-R1， Qwen-2.5和gpt - 40。在反合成方面，它在USPTO-50K数据集上达到了68.3%的Top-1准确率，并成功再现了文献和专利中报道的7条完整的药物合成路线。这些结果强调了领域适应LLM微调与动态多模型任务调度相结合的有效性，为化学知识问答和逆向合成规划提供了可扩展和健壮的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CCS Chemistry Chemistry-General Chemistry

CiteScore

13.60

自引率

13.40%

发文量

475

审稿时长

10 weeks

期刊介绍： CCS Chemistry, the flagship publication of the Chinese Chemical Society, stands as a leading international chemistry journal based in China. With a commitment to global outreach in both contributions and readership, the journal operates on a fully Open Access model, eliminating subscription fees for contributing authors. Issued monthly, all articles are published online promptly upon reaching final publishable form. Additionally, authors have the option to expedite the posting process through Immediate Online Accepted Article posting, making a PDF of their accepted article available online upon journal acceptance.