Improving machine-learning models in materials science through large datasets

IF 10 2区 材料科学 Q1 MATERIALS SCIENCE, MULTIDISCIPLINARY
Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques
{"title":"Improving machine-learning models in materials science through large datasets","authors":"Jonathan Schmidt ,&nbsp;Tiago F.T. Cerqueira ,&nbsp;Aldo H. Romero ,&nbsp;Antoine Loew ,&nbsp;Fabian Jäger ,&nbsp;Hai-Chen Wang ,&nbsp;Silvana Botti ,&nbsp;Miguel A.L. Marques","doi":"10.1016/j.mtphys.2024.101560","DOIUrl":null,"url":null,"abstract":"<div><div>The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present <span>alexandria</span>, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.</div></div>","PeriodicalId":18253,"journal":{"name":"Materials Today Physics","volume":"48 ","pages":"Article 101560"},"PeriodicalIF":10.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Today Physics","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542529324002360","RegionNum":2,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.

Abstract Image

通过大型数据集改进材料科学中的机器学习模型
机器学习模型的准确性受限于可用于训练和验证的数据的质量和数量。这个问题在材料科学领域尤其具有挑战性,因为材料科学领域缺乏大规模、高质量和一致性的数据集。在这里,我们介绍亚历山大(alexandria),这是一个开放式数据库,包含 500 多万个周期性三维、二维和一维化合物的密度泛函理论计算结果。我们利用这些数据训练机器学习模型,使用基于成分的模型和晶体图神经网络重现七种不同的性质。在大多数情况下,模型的误差会随着训练数据的增加而单调减少,但有些图网络在训练集规模较大时似乎会达到饱和。训练中的差异可能与不同属性的统计分布有关。我们还观察到,与简单的基于组成的方法相比,能够获取详细几何信息的图网络一般能生成更精确的模型。最后,我们评估了几种通用的机器学习原子间势。使用这些力场优化的晶体几何图形质量非常高,但遗憾的是能量的准确性仍然不足。此外,我们还观察到这些模型的训练集中取样不足的化学空间区域存在一些不稳定性。这项研究凸显了大规模、高质量数据集在改进材料科学领域机器学习模型方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Materials Today Physics
Materials Today Physics Materials Science-General Materials Science
CiteScore
14.00
自引率
7.80%
发文量
284
审稿时长
15 days
期刊介绍: Materials Today Physics is a multi-disciplinary journal focused on the physics of materials, encompassing both the physical properties and materials synthesis. Operating at the interface of physics and materials science, this journal covers one of the largest and most dynamic fields within physical science. The forefront research in materials physics is driving advancements in new materials, uncovering new physics, and fostering novel applications at an unprecedented pace.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信