Nawaf Alampara , Mara Schilling-Wilhelmi , Kevin Maik Jablonka
{"title":"Lessons from the trenches on evaluating machine learning systems in materials science","authors":"Nawaf Alampara , Mara Schilling-Wilhelmi , Kevin Maik Jablonka","doi":"10.1016/j.commatsci.2025.114041","DOIUrl":null,"url":null,"abstract":"<div><div>Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress.</div><div>In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs.</div><div>By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations.</div><div>Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.</div></div>","PeriodicalId":10650,"journal":{"name":"Computational Materials Science","volume":"259 ","pages":"Article 114041"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Materials Science","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0927025625003842","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Measurements are fundamental to knowledge creation in science, enabling consistent sharing of findings and serving as the foundation for scientific discovery. As machine learning systems increasingly transform scientific fields, the question of how to effectively evaluate these systems becomes crucial for ensuring reliable progress.
In this review, we examine the current state and future directions of evaluation frameworks for machine learning in science. We organize the review around a broadly applicable framework for evaluating machine learning systems through the lens of statistical measurement theory, using materials science as our primary context for examples and case studies. We identify key challenges common across machine learning evaluation such as construct validity, data quality issues, metric design limitations, and benchmark maintenance problems that can lead to phantom progress when evaluation frameworks fail to capture real-world performance needs.
By examining both traditional benchmarks and emerging evaluation approaches, we demonstrate how evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. These findings reveal the critical need for transparency in evaluation design and reporting, leading us to propose evaluation cards as a structured approach to documenting measurement choices and limitations.
Our work highlights the importance of developing a more diverse toolbox of evaluation techniques for machine learning in materials science, while offering insights that can inform evaluation practices in other scientific domains where similar challenges exist.
期刊介绍:
The goal of Computational Materials Science is to report on results that provide new or unique insights into, or significantly expand our understanding of, the properties of materials or phenomena associated with their design, synthesis, processing, characterization, and utilization. To be relevant to the journal, the results should be applied or applicable to specific material systems that are discussed within the submission.