MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Matthew D. Witman and Peter Schindler
{"title":"MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†","authors":"Matthew D. Witman and Peter Schindler","doi":"10.1039/D4DD00250D","DOIUrl":null,"url":null,"abstract":"<p >Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local <em>vs.</em> global property prediction tasks, small <em>vs.</em> large datasets, and structure <em>vs.</em> compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 625-635"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00250d?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00250d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.

Abstract Image

MatFold:通过标准化交叉验证协议系统地洞察材料发现模型的性能
通过过于简单的交叉验证(CV)协议验证的材料科学中的机器学习(ML)模型可能会对下游建模或材料筛选任务产生有偏差的性能估计。对于那些失败的验证工作(实验合成、表征和测试)的时间和成本是重要的应用程序来说,这可能会适得其反。我们为化学和结构驱动的CV提出了一套标准化且日益困难的分裂协议,可以遵循该协议来验证任何用于材料发现的ML模型。这样做的好处之一是,可以系统地了解模型的泛化性、可改进性和不确定性,为访问不同数量数据的竞争模型之间的公平比较提供基准,并通过日益严格的分割协议系统地减少可能的数据泄漏。通过对越来越严格的化学/结构分裂标准、局部与全局属性预测任务、小型与大型数据集、结构与成分模型架构进行彻底的CV调查,可以观察到一些共同的线索;然而,这些样本之间存在一些明显的差异,这表明需要进行综合分析,以充分了解每个模型的泛化精度和材料发现潜力。为此,我们提供了一个通用的、与特性无关的工具包MatFold,用于自动化这些CV拆分的可重复构建,并鼓励社区在模型基准测试中进一步使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信