开发用于量子化学模拟输入生成的大型语言模型†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-02-05 DOI:10.1039/D4DD00366G

Pieter Floris Jacobs and Robert Pollice

{"title":"开发用于量子化学模拟输入生成的大型语言模型†","authors":"Pieter Floris Jacobs and Robert Pollice","doi":"10.1039/D4DD00366G","DOIUrl":null,"url":null,"abstract":"Scientists across domains are often challenged to master domain-specific languages (DSLs) for their research, which are merely a means to an end but are pervasive in fields like computational chemistry. Automated code generation promises to overcome this barrier, allowing researchers to focus on their core expertise. While large language models (LLMs) have shown impressive capabilities in synthesizing code from natural language prompts, they often struggle with DSLs, likely due to their limited exposure during training. In this work, we investigate the potential of foundational LLMs for generating input files for the quantum chemistry package ORCA by establishing a general framework that can be adapted to other DSLs. To improve upon <img> as our base model, we explore the impact of prompt engineering, retrieval-augmented generation, and finetuning via synthetically generated datasets. We find that finetuning, even with synthetic datasets as small as 500 samples, significantly improves performance. Additionally, we observe that finetuning shows synergism with advanced prompt engineering such as chain-of-thought prompting. Consequently, our best finetuned models outperform the formally much more powerful <img> model. In turn, finetuning GPT-4o with the same small synthetic dataset leads to a further substantial performance improvement, suggesting our approach to be more general rather than limited to LLMs with poor base proficiency. All tools and datasets are made openly available for future research. We believe that this research lays the groundwork for a wider adoption of LLMs for DSLs in chemistry and beyond.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 762-775"},"PeriodicalIF":6.2000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00366g?page=search","citationCount":"0","resultStr":"{\"title\":\"Developing large language models for quantum chemistry simulation input generation†\",\"authors\":\"Pieter Floris Jacobs and Robert Pollice\",\"doi\":\"10.1039/D4DD00366G\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scientists across domains are often challenged to master domain-specific languages (DSLs) for their research, which are merely a means to an end but are pervasive in fields like computational chemistry. Automated code generation promises to overcome this barrier, allowing researchers to focus on their core expertise. While large language models (LLMs) have shown impressive capabilities in synthesizing code from natural language prompts, they often struggle with DSLs, likely due to their limited exposure during training. In this work, we investigate the potential of foundational LLMs for generating input files for the quantum chemistry package ORCA by establishing a general framework that can be adapted to other DSLs. To improve upon <img> as our base model, we explore the impact of prompt engineering, retrieval-augmented generation, and finetuning via synthetically generated datasets. We find that finetuning, even with synthetic datasets as small as 500 samples, significantly improves performance. Additionally, we observe that finetuning shows synergism with advanced prompt engineering such as chain-of-thought prompting. Consequently, our best finetuned models outperform the formally much more powerful <img> model. In turn, finetuning GPT-4o with the same small synthetic dataset leads to a further substantial performance improvement, suggesting our approach to be more general rather than limited to LLMs with poor base proficiency. All tools and datasets are made openly available for future research. We believe that this research lays the groundwork for a wider adoption of LLMs for DSLs in chemistry and beyond.\",\"PeriodicalId\":72816,\"journal\":{\"name\":\"Digital discovery\",\"volume\":\" 3\",\"pages\":\" 762-775\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00366g?page=search\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00366g\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00366g","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

跨领域的科学家经常面临着掌握领域特定语言（dsl）进行研究的挑战，这只是达到目的的一种手段，但在计算化学等领域却很普遍。自动化代码生成有望克服这一障碍，使研究人员能够专注于他们的核心专业知识。虽然大型语言模型（llm）在从自然语言提示合成代码方面表现出了令人印象深刻的能力，但它们经常与dsl作斗争，这可能是由于它们在训练期间的暴露有限。在这项工作中，我们通过建立一个可以适应其他dsl的通用框架，研究了基础llm为量子化学包ORCA生成输入文件的潜力。为了改进我们的基础模型，我们探索了即时工程、检索增强生成和通过综合生成的数据集进行微调的影响。我们发现，即使对500个样本这样小的合成数据集进行微调，也能显著提高性能。此外，我们观察到微调与先进的提示工程（如思维链提示）显示协同作用。因此，我们最好的微调模型比正式的更强大的模型表现得更好。反过来，使用相同的小型合成数据集对gpt - 40进行微调会导致进一步的实质性性能改进，这表明我们的方法更通用，而不是局限于基础熟练程度较差的llm。所有的工具和数据集都是公开的，供未来的研究使用。我们相信这项研究为法学硕士在化学及其他领域的广泛应用奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Developing large language models for quantum chemistry simulation input generation†

查看原文本刊更多论文

Developing large language models for quantum chemistry simulation input generation†

Scientists across domains are often challenged to master domain-specific languages (DSLs) for their research, which are merely a means to an end but are pervasive in fields like computational chemistry. Automated code generation promises to overcome this barrier, allowing researchers to focus on their core expertise. While large language models (LLMs) have shown impressive capabilities in synthesizing code from natural language prompts, they often struggle with DSLs, likely due to their limited exposure during training. In this work, we investigate the potential of foundational LLMs for generating input files for the quantum chemistry package ORCA by establishing a general framework that can be adapted to other DSLs. To improve upon as our base model, we explore the impact of prompt engineering, retrieval-augmented generation, and finetuning via synthetically generated datasets. We find that finetuning, even with synthetic datasets as small as 500 samples, significantly improves performance. Additionally, we observe that finetuning shows synergism with advanced prompt engineering such as chain-of-thought prompting. Consequently, our best finetuned models outperform the formally much more powerful model. In turn, finetuning GPT-4o with the same small synthetic dataset leads to a further substantial performance improvement, suggesting our approach to be more general rather than limited to LLMs with poor base proficiency. All tools and datasets are made openly available for future research. We believe that this research lays the groundwork for a wider adoption of LLMs for DSLs in chemistry and beyond.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量