一个框架来评估化学知识和推理能力的大型语言模型对化学家的专业知识

IF 19.2 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Nature chemistry Pub Date : 2025-05-20 DOI:10.1038/s41557-025-01815-x

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, A. D. Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka

{"title":"一个框架来评估化学知识和推理能力的大型语言模型对化学家的专业知识","authors":"Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, A. D. Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka","doi":"10.1038/s41557-025-01815-x","DOIUrl":null,"url":null,"abstract":"<p>Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.</p><figure></figure>","PeriodicalId":18909,"journal":{"name":"Nature chemistry","volume":"11 1","pages":""},"PeriodicalIF":19.2000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists\",\"authors\":\"Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, A. D. Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka\",\"doi\":\"10.1038/s41557-025-01815-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.</p><figure></figure>\",\"PeriodicalId\":18909,\"journal\":{\"name\":\"Nature chemistry\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":19.2000,\"publicationDate\":\"2025-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature chemistry\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1038/s41557-025-01815-x\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1038/s41557-025-01815-x","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）由于其处理人类语言和执行未被明确训练的任务的能力而获得了广泛的兴趣。然而，我们对法学硕士的化学能力只有有限的系统了解，这将需要改进模型和减轻潜在的危害。在这里，我们介绍ChemBench，这是一个自动框架，用于评估最先进的法学硕士对化学家专业知识的化学知识和推理能力。我们策划了超过2700个问答对，评估了领先的开源和闭源法学硕士，发现在我们的研究中，最好的模型平均表现优于最好的人类化学家。然而，这些模型在一些基本任务上存在问题，并且提供了过于自信的预测。这些发现揭示了llm令人印象深刻的化学能力，同时强调了进一步研究以提高其安全性和实用性的必要性。他们还建议调整化学教育，并展示了评估特定领域法学硕士的基准框架的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

查看原文本刊更多论文

A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Nature chemistry 化学-化学综合

CiteScore

29.60

自引率

1.40%

发文量

226

审稿时长

1.7 months

期刊介绍： Nature Chemistry is a monthly journal that publishes groundbreaking and significant research in all areas of chemistry. It covers traditional subjects such as analytical, inorganic, organic, and physical chemistry, as well as a wide range of other topics including catalysis, computational and theoretical chemistry, and environmental chemistry. The journal also features interdisciplinary research at the interface of chemistry with biology, materials science, nanotechnology, and physics. Manuscripts detailing such multidisciplinary work are encouraged, as long as the central theme pertains to chemistry. Aside from primary research, Nature Chemistry publishes review articles, news and views, research highlights from other journals, commentaries, book reviews, correspondence, and analysis of the broader chemical landscape. It also addresses crucial issues related to education, funding, policy, intellectual property, and the societal impact of chemistry. Nature Chemistry is dedicated to ensuring the highest standards of original research through a fair and rigorous review process. It offers authors maximum visibility for their papers, access to a broad readership, exceptional copy editing and production standards, rapid publication, and independence from academic societies and other vested interests. Overall, Nature Chemistry aims to be the authoritative voice of the global chemical community.