Investigating the error imbalance of large-scale machine learning potentials in catalysis

IF 4.4 3区 化学 Q2 CHEMISTRY, PHYSICAL
Kareem Abdelmaqsoud, Muhammed Shuaibi, Adeesh Kolluru, Raffaele Cheula, John R. Kitchin
{"title":"Investigating the error imbalance of large-scale machine learning potentials in catalysis","authors":"Kareem Abdelmaqsoud, Muhammed Shuaibi, Adeesh Kolluru, Raffaele Cheula, John R. Kitchin","doi":"10.1039/d4cy00615a","DOIUrl":null,"url":null,"abstract":"Machine learning potentials (MLPs) have greatly accelerated atomistic simulations for material discovery. The Open Catalyst 2020 (OC20) dataset is one of the largest datasets for training MLPs for heterogeneous catalysis. The mean absolute errors (MAE) of the MLPs on the energy target of the dataset have asymptotically approached about 0.2 eV over the past two years with increasingly sophisticated models. The errors were found to be imbalanced between the different material classes with non-metals having the highest errors. In this work, we investigate several potential sources for the imbalanced distribution of errors. We examined material class-specific convergence errors in the density functional theory (DFT) calculations including <em>k</em>-point sampling, plane wave cutoff and smearing width. Significant DFT convergence errors with a mean absolute value of ∼0.15 eV were found on the total energies of non-metals, higher than the other material classes. However, as a result of cancellation of errors, convergence errors on adsorption energies have a mean absolute value of ∼0.05 eV across all material classes. Moreover, we found that the MAEs of the MLPs are not affected by these convergence errors. Second, we show that calculations with surface reconstruction can introduce inconsistencies to the adsorption energy referencing scheme that cannot be fit by the MLPs. Nonmetals and halides were found to have the highest fraction of calculations with surface reconstructions. Removing calculations with surface reconstructions from the validation sets, without re-training, significantly lowers the MAEs by ∼35% and reduces the imbalance of the MAEs. Alternatively, MLPs trained on total energies provide a solution to the surface reconstruction inconsistencies since they eliminate the referencing issue, and have comparable MAEs to MLPs trained on adsorption energies.","PeriodicalId":66,"journal":{"name":"Catalysis Science & Technology","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Catalysis Science & Technology","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d4cy00615a","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning potentials (MLPs) have greatly accelerated atomistic simulations for material discovery. The Open Catalyst 2020 (OC20) dataset is one of the largest datasets for training MLPs for heterogeneous catalysis. The mean absolute errors (MAE) of the MLPs on the energy target of the dataset have asymptotically approached about 0.2 eV over the past two years with increasingly sophisticated models. The errors were found to be imbalanced between the different material classes with non-metals having the highest errors. In this work, we investigate several potential sources for the imbalanced distribution of errors. We examined material class-specific convergence errors in the density functional theory (DFT) calculations including k-point sampling, plane wave cutoff and smearing width. Significant DFT convergence errors with a mean absolute value of ∼0.15 eV were found on the total energies of non-metals, higher than the other material classes. However, as a result of cancellation of errors, convergence errors on adsorption energies have a mean absolute value of ∼0.05 eV across all material classes. Moreover, we found that the MAEs of the MLPs are not affected by these convergence errors. Second, we show that calculations with surface reconstruction can introduce inconsistencies to the adsorption energy referencing scheme that cannot be fit by the MLPs. Nonmetals and halides were found to have the highest fraction of calculations with surface reconstructions. Removing calculations with surface reconstructions from the validation sets, without re-training, significantly lowers the MAEs by ∼35% and reduces the imbalance of the MAEs. Alternatively, MLPs trained on total energies provide a solution to the surface reconstruction inconsistencies since they eliminate the referencing issue, and have comparable MAEs to MLPs trained on adsorption energies.

Abstract Image

调查催化领域大规模机器学习潜力的误差不平衡性
机器学习势能(MLP)大大加快了用于材料发现的原子模拟。开放催化剂 2020(OC20)数据集是用于训练异质催化 MLP 的最大数据集之一。在过去两年中,随着模型越来越复杂,该数据集能量目标上的 MLP 平均绝对误差(MAE)逐渐接近 0.2 eV。研究发现,不同材料类别之间的误差并不平衡,非金属材料的误差最大。在这项工作中,我们研究了误差分布不平衡的几个潜在原因。我们研究了密度泛函理论(DFT)计算中特定材料类别的收敛误差,包括 k 点采样、平面波截止和涂抹宽度。结果发现,非金属的总能量存在显著的 DFT 收敛误差,平均绝对值为 0.15 eV,高于其他材料类别。然而,由于误差的抵消,所有材料类别的吸附能收敛误差的平均绝对值为 ∼ 0.05 eV。此外,我们发现 MLP 的 MAEs 不受这些收敛误差的影响。其次,我们发现表面重构计算会给吸附能参考方案带来不一致,而这些不一致是 MLP 无法拟合的。我们发现非金属和卤化物的表面重构计算比例最高。在不重新训练的情况下,从验证集中移除带有表面重构的计算,可显著降低 MAEs ∼ 35%,并减少 MAEs 的不平衡性。另外,根据总能量训练的 MLP 也能解决表面重构不一致的问题,因为它们消除了参照问题,其 MAE 与根据吸附能量训练的 MLP 相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Catalysis Science & Technology
Catalysis Science & Technology CHEMISTRY, PHYSICAL-
CiteScore
8.70
自引率
6.00%
发文量
587
审稿时长
1.5 months
期刊介绍: A multidisciplinary journal focusing on cutting edge research across all fundamental science and technological aspects of catalysis. Editor-in-chief: Bert Weckhuysen Impact factor: 5.0 Time to first decision (peer reviewed only): 31 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信