融合领域知识与微调大语言模型增强分子性质预测。

IF 5.7 1区化学 Q2 CHEMISTRY, PHYSICAL

Journal of Chemical Theory and Computation Pub Date : 2025-07-09 DOI:10.1021/acs.jctc.5c00605

Liangxu Xie,Yingdi Jin,Lei Xu,Shan Chang,Xiaojun Xu

{"title":"融合领域知识与微调大语言模型增强分子性质预测。","authors":"Liangxu Xie,Yingdi Jin,Lei Xu,Shan Chang,Xiaojun Xu","doi":"10.1021/acs.jctc.5c00605","DOIUrl":null,"url":null,"abstract":"Although large language models (LLMs) have flourished in various scientific applications, their applications in the specific task of molecular property prediction have not reached a satisfactory level, even for the specific chemistry LLMs. This work addresses a highly crucial and significant challenge existing in the field of drug discovery: accurately predicting the molecular properties by effectively leveraging LLMs enhanced with profound domain knowledge. We propose a Knowledge-Fused Large Language Model for dual-Modality (KFLM2) learning for molecular property prediction. The aim is to utilize the capabilities of advanced LLMs, strengthened with specialized knowledge in the field of drug discovery. We identified DeepSeek-R1-Distill-Qwen-1.5B as the optimal base model from three DeepSeek-R1 distilled LLMs and one chemistry LLM named ChemDFM, by fine-tuning with the ZINC and ChEMBL datasets. We obtained the SMILES embeddings from the fine-tuned model and subsequently integrated the embeddings with the molecular graph to leverage complementary information for predicting molecular properties. Finally, we trained the hybrid neural network on the combined dual modality inputs and predicted the molecular properties. Through benchmarking on regression and classification tasks, our proposed method can obtain higher prediction performance for nine out of ten datasets in the downstream regression and classification tasks. Visualization of the output of hidden layers indicates that the combination of the embedding with the molecular graph can offer complementary information to further improve the prediction accuracy compared with either the LLM embedding or the molecular graph inputs. Larger models do not inherently guarantee superior performance; instead, their effectiveness hinges on our ability to leverage relevant knowledge from both pretraining and fine-tuning. Implementing LLMs with domain knowledge would be a rational approach to making precise predictions that could potentially revolutionize the process of drug development and discovery.","PeriodicalId":45,"journal":{"name":"Journal of Chemical Theory and Computation","volume":"4 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction.\",\"authors\":\"Liangxu Xie,Yingdi Jin,Lei Xu,Shan Chang,Xiaojun Xu\",\"doi\":\"10.1021/acs.jctc.5c00605\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although large language models (LLMs) have flourished in various scientific applications, their applications in the specific task of molecular property prediction have not reached a satisfactory level, even for the specific chemistry LLMs. This work addresses a highly crucial and significant challenge existing in the field of drug discovery: accurately predicting the molecular properties by effectively leveraging LLMs enhanced with profound domain knowledge. We propose a Knowledge-Fused Large Language Model for dual-Modality (KFLM2) learning for molecular property prediction. The aim is to utilize the capabilities of advanced LLMs, strengthened with specialized knowledge in the field of drug discovery. We identified DeepSeek-R1-Distill-Qwen-1.5B as the optimal base model from three DeepSeek-R1 distilled LLMs and one chemistry LLM named ChemDFM, by fine-tuning with the ZINC and ChEMBL datasets. We obtained the SMILES embeddings from the fine-tuned model and subsequently integrated the embeddings with the molecular graph to leverage complementary information for predicting molecular properties. Finally, we trained the hybrid neural network on the combined dual modality inputs and predicted the molecular properties. Through benchmarking on regression and classification tasks, our proposed method can obtain higher prediction performance for nine out of ten datasets in the downstream regression and classification tasks. Visualization of the output of hidden layers indicates that the combination of the embedding with the molecular graph can offer complementary information to further improve the prediction accuracy compared with either the LLM embedding or the molecular graph inputs. Larger models do not inherently guarantee superior performance; instead, their effectiveness hinges on our ability to leverage relevant knowledge from both pretraining and fine-tuning. Implementing LLMs with domain knowledge would be a rational approach to making precise predictions that could potentially revolutionize the process of drug development and discovery.\",\"PeriodicalId\":45,\"journal\":{\"name\":\"Journal of Chemical Theory and Computation\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Theory and Computation\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jctc.5c00605\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Theory and Computation","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jctc.5c00605","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

尽管大语言模型（llm）在各种科学应用中蓬勃发展，但它们在分子性质预测的特定任务中的应用尚未达到令人满意的水平，即使是在特定的化学llm中也是如此。这项工作解决了药物发现领域存在的一个非常关键和重大的挑战：通过有效地利用法学硕士来准确预测分子性质，并增强了深厚的领域知识。我们提出了一种知识融合的双模态大语言模型（KFLM2），用于分子性质预测。目的是利用先进的法学硕士的能力，加强在药物发现领域的专业知识。通过对ZINC和ChEMBL数据集的微调，我们从三个DeepSeek-R1蒸馏LLM和一个化学LLM （ChemDFM）中确定了DeepSeek-R1蒸馏LLM -蒸馏qwen -1.5 b作为最优基础模型。我们从微调模型中获得了SMILES嵌入，并随后将嵌入与分子图集成，以利用互补信息预测分子性质。最后，对混合神经网络进行组合双模态输入训练，并对分子性质进行预测。通过对回归和分类任务的基准测试，在下游回归和分类任务中，我们提出的方法对90%的数据集具有较高的预测性能。隐藏层输出的可视化表明，与LLM嵌入或分子图输入相比，嵌入与分子图的结合可以提供互补的信息，进一步提高预测精度。较大的模型并不一定保证优越的性能；相反，它们的有效性取决于我们从预训练和微调中利用相关知识的能力。实施具有领域知识的法学硕士将是一种合理的方法，可以做出精确的预测，可能会彻底改变药物开发和发现的过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fusing Domain Knowledge with a Fine-Tuned Large Language Model for Enhanced Molecular Property Prediction.

Although large language models (LLMs) have flourished in various scientific applications, their applications in the specific task of molecular property prediction have not reached a satisfactory level, even for the specific chemistry LLMs. This work addresses a highly crucial and significant challenge existing in the field of drug discovery: accurately predicting the molecular properties by effectively leveraging LLMs enhanced with profound domain knowledge. We propose a Knowledge-Fused Large Language Model for dual-Modality (KFLM2) learning for molecular property prediction. The aim is to utilize the capabilities of advanced LLMs, strengthened with specialized knowledge in the field of drug discovery. We identified DeepSeek-R1-Distill-Qwen-1.5B as the optimal base model from three DeepSeek-R1 distilled LLMs and one chemistry LLM named ChemDFM, by fine-tuning with the ZINC and ChEMBL datasets. We obtained the SMILES embeddings from the fine-tuned model and subsequently integrated the embeddings with the molecular graph to leverage complementary information for predicting molecular properties. Finally, we trained the hybrid neural network on the combined dual modality inputs and predicted the molecular properties. Through benchmarking on regression and classification tasks, our proposed method can obtain higher prediction performance for nine out of ten datasets in the downstream regression and classification tasks. Visualization of the output of hidden layers indicates that the combination of the embedding with the molecular graph can offer complementary information to further improve the prediction accuracy compared with either the LLM embedding or the molecular graph inputs. Larger models do not inherently guarantee superior performance; instead, their effectiveness hinges on our ability to leverage relevant knowledge from both pretraining and fine-tuning. Implementing LLMs with domain knowledge would be a rational approach to making precise predictions that could potentially revolutionize the process of drug development and discovery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Theory and Computation 化学-物理：原子、分子和化学物理

CiteScore

9.90

自引率

16.40%

发文量

568

审稿时长

1 months

期刊介绍： The Journal of Chemical Theory and Computation invites new and original contributions with the understanding that, if accepted, they will not be published elsewhere. Papers reporting new theories, methodology, and/or important applications in quantum electronic structure, molecular dynamics, and statistical mechanics are appropriate for submission to this Journal. Specific topics include advances in or applications of ab initio quantum mechanics, density functional theory, design and properties of new materials, surface science, Monte Carlo simulations, solvation models, QM/MM calculations, biomolecular structure prediction, and molecular dynamics in the broadest sense including gas-phase dynamics, ab initio dynamics, biomolecular dynamics, and protein folding. The Journal does not consider papers that are straightforward applications of known methods including DFT and molecular dynamics. The Journal favors submissions that include advances in theory or methodology with applications to compelling problems.