评估llm在材料科学问答和性能预测中的性能和稳健性

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-05-28 DOI:10.1039/D5DD00090D

Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim and Jason Hattrick-Simpers

{"title":"评估llm在材料科学问答和性能预测中的性能和稳健性","authors":"Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim and Jason Hattrick-Simpers","doi":"10.1039/D5DD00090D","DOIUrl":null,"url":null,"abstract":"<p >Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: (1) a set of multiple-choice questions from undergraduate-level materials science courses, (2) a dataset including various steel compositions and yield strengths, and (3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of “noise”, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1612-1624"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00090d?page=search","citationCount":"0","resultStr":"{\"title\":\"Evaluating the performance and robustness of LLMs in materials science Q&A and property predictions†\",\"authors\":\"Hongchen Wang, Kangming Li, Scott Ramsay, Yao Fehlis, Edward Kim and Jason Hattrick-Simpers\",\"doi\":\"10.1039/D5DD00090D\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: (1) a set of multiple-choice questions from undergraduate-level materials science courses, (2) a dataset including various steel compositions and yield strengths, and (3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of “noise”, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.</p>\",\"PeriodicalId\":72816,\"journal\":{\"name\":\"Digital discovery\",\"volume\":\" 6\",\"pages\":\" 1612-1624\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00090d?page=search\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00090d\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00090d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）具有革新科学研究的潜力，但是它们在特定领域应用程序中的健壮性和可靠性仍然没有得到充分的探索。在本研究中，我们评估了材料科学法学硕士的性能和鲁棒性，重点关注领域特定的问题回答和材料属性预测在不同的现实世界和对抗条件下。本研究使用了三个不同的数据集：(1)一组来自本科水平材料科学课程的多项选择题，(2)一个包含各种钢成分和屈服强度的数据集，(3)一个包含材料晶体结构和带隙值文本描述的带隙数据集。法学硕士的表现是通过各种提示策略来评估的，包括零次思维链、专家提示和少次上下文学习。这些模型的鲁棒性测试了各种形式的“噪声”，从现实的干扰到有意的对抗性操纵，以评估它们在现实世界条件下的弹性和可靠性。此外，该研究还展示了llm在预测任务中的独特现象，例如当提示示例的接近度被改变时的模式崩溃行为，以及训练/测试不匹配时的性能恢复。研究结果旨在为法学硕士在材料科学中的广泛应用提供有根据的怀疑，并激发其在实际应用中增强稳健性和可靠性的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluating the performance and robustness of LLMs in materials science Q&A and property predictions†

查看原文本刊更多论文

Evaluating the performance and robustness of LLMs in materials science Q&A and property predictions†

Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: (1) a set of multiple-choice questions from undergraduate-level materials science courses, (2) a dataset including various steel compositions and yield strengths, and (3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of “noise”, ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量