Lessons from the trenches on evaluating machine learning systems in materials science

IF 3.3 3区 材料科学 Q2 MATERIALS SCIENCE, MULTIDISCIPLINARY
Nawaf Alampara , Mara Schilling-Wilhelmi , Kevin Maik Jablonka
{"title":"Lessons from the trenches on evaluating machine learning systems in materials science","authors":"Nawaf Alampara ,&nbsp;Mara Schilling-Wilhelmi ,&nbsp;Kevin Maik Jablonka","doi":"10.1016/j.commatsci.2025.114041","DOIUrl":null,"url":null,"abstract":"<div><div>Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress.</div><div>In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs.</div><div>By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations.</div><div>Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.</div></div>","PeriodicalId":10650,"journal":{"name":"Computational Materials Science","volume":"259 ","pages":"Article 114041"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Materials Science","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0927025625003842","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress.
In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs.
By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations.
Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.

Abstract Image

材料科学中评估机器学习系统的经验教训
测量是科学知识创造的基础,它使人们能够一致地分享发现,并成为科学发现的基础。随着机器学习系统日益改变科学领域,如何有效评估这些系统的问题对于确保可靠的进展至关重要。在这篇综述中,我们研究了科学中机器学习评估框架的现状和未来方向。我们通过统计测量理论的视角,围绕一个广泛适用的评估机器学习系统的框架来组织审查,使用材料科学作为我们的主要示例和案例研究背景。我们确定了机器学习评估中常见的关键挑战,如结构有效性、数据质量问题、度量设计限制和基准维护问题,当评估框架无法捕获实际性能需求时,这些问题可能导致虚幻的进展。通过研究传统的基准和新兴的评估方法,我们展示了评估选择如何从根本上塑造我们的测量,而且研究优先级和科学进步。这些发现揭示了评估设计和报告中透明度的关键需求,导致我们提出评估卡作为记录测量选择和限制的结构化方法。我们的工作强调了为材料科学中的机器学习开发更多样化的评估技术工具箱的重要性,同时提供了可以为存在类似挑战的其他科学领域的评估实践提供信息的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational Materials Science
Computational Materials Science 工程技术-材料科学:综合
CiteScore
6.50
自引率
6.10%
发文量
665
审稿时长
26 days
期刊介绍: The goal of Computational Materials Science is to report on results that provide new or unique insights into, or significantly expand our understanding of, the properties of materials or phenomena associated with their design, synthesis, processing, characterization, and utilization. To be relevant to the journal, the results should be applied or applicable to specific material systems that are discussed within the submission.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信