A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists

IF 19.2 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, A. D. Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka
{"title":"A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists","authors":"Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole C. Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, A. D. Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka","doi":"10.1038/s41557-025-01815-x","DOIUrl":null,"url":null,"abstract":"<p>Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.</p><figure></figure>","PeriodicalId":18909,"journal":{"name":"Nature chemistry","volume":"11 1","pages":""},"PeriodicalIF":19.2000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1038/s41557-025-01815-x","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.

Abstract Image

一个框架来评估化学知识和推理能力的大型语言模型对化学家的专业知识
大型语言模型(llm)由于其处理人类语言和执行未被明确训练的任务的能力而获得了广泛的兴趣。然而,我们对法学硕士的化学能力只有有限的系统了解,这将需要改进模型和减轻潜在的危害。在这里,我们介绍ChemBench,这是一个自动框架,用于评估最先进的法学硕士对化学家专业知识的化学知识和推理能力。我们策划了超过2700个问答对,评估了领先的开源和闭源法学硕士,发现在我们的研究中,最好的模型平均表现优于最好的人类化学家。然而,这些模型在一些基本任务上存在问题,并且提供了过于自信的预测。这些发现揭示了llm令人印象深刻的化学能力,同时强调了进一步研究以提高其安全性和实用性的必要性。他们还建议调整化学教育,并展示了评估特定领域法学硕士的基准框架的价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nature chemistry
Nature chemistry 化学-化学综合
CiteScore
29.60
自引率
1.40%
发文量
226
审稿时长
1.7 months
期刊介绍: Nature Chemistry is a monthly journal that publishes groundbreaking and significant research in all areas of chemistry. It covers traditional subjects such as analytical, inorganic, organic, and physical chemistry, as well as a wide range of other topics including catalysis, computational and theoretical chemistry, and environmental chemistry. The journal also features interdisciplinary research at the interface of chemistry with biology, materials science, nanotechnology, and physics. Manuscripts detailing such multidisciplinary work are encouraged, as long as the central theme pertains to chemistry. Aside from primary research, Nature Chemistry publishes review articles, news and views, research highlights from other journals, commentaries, book reviews, correspondence, and analysis of the broader chemical landscape. It also addresses crucial issues related to education, funding, policy, intellectual property, and the societal impact of chemistry. Nature Chemistry is dedicated to ensuring the highest standards of original research through a fair and rigorous review process. It offers authors maximum visibility for their papers, access to a broad readership, exceptional copy editing and production standards, rapid publication, and independence from academic societies and other vested interests. Overall, Nature Chemistry aims to be the authoritative voice of the global chemical community.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信