Data Generation for Machine Learning Interatomic Potentials and Beyond

IF 51.4 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Maksim Kulichenko, Benjamin Nebgen, Nicholas Lubbers, Justin S. Smith, Kipton Barros, Alice E. A. Allen, Adela Habib, Emily Shinkle, Nikita Fedik, Ying Wai Li, Richard A. Messerly, Sergei Tretiak
{"title":"Data Generation for Machine Learning Interatomic Potentials and Beyond","authors":"Maksim Kulichenko, Benjamin Nebgen, Nicholas Lubbers, Justin S. Smith, Kipton Barros, Alice E. A. Allen, Adela Habib, Emily Shinkle, Nikita Fedik, Ying Wai Li, Richard A. Messerly, Sergei Tretiak","doi":"10.1021/acs.chemrev.4c00572","DOIUrl":null,"url":null,"abstract":"The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.","PeriodicalId":32,"journal":{"name":"Chemical Reviews","volume":"55 1","pages":""},"PeriodicalIF":51.4000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Reviews","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.chemrev.4c00572","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.

Abstract Image

机器学习的数据生成 原子间电位及其他
在用于预测分子性质和行为的机器学习模型创新的推动下,数据驱动化学领域正在经历一场变革。最近,基于 ML 的原子间位势技术取得了长足进步,为在原子水平上准确模拟各种化学和结构特性铺平了道路。决定 MLIP 可靠性的关键因素仍然是训练数据的质量。最重要的挑战在于构建能捕捉广阔化学和结构空间中特定领域的训练集。本综述介绍了训练数据的基本组成部分和完整性的复杂情况,以确保所生成模型的可扩展性和可转移性。我们深入探讨了主动学习的细节,讨论了其各个方面和实现方法。我们概述了应用于原子数据采集的不同类型的不确定性量化,以及估计的不确定性与真实误差之间的相关性。我们还强调了原子数据采样器在生成多样化信息结构中的作用。此外,我们还讨论了通过修正势能面和替代势能面获取数据,以此作为使训练数据多样化的创新方法。本综述还提供了一份涵盖化学空间重要领域的公开可用数据集列表。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Chemical Reviews
Chemical Reviews 化学-化学综合
CiteScore
106.00
自引率
1.10%
发文量
278
审稿时长
4.3 months
期刊介绍: Chemical Reviews is a highly regarded and highest-ranked journal covering the general topic of chemistry. Its mission is to provide comprehensive, authoritative, critical, and readable reviews of important recent research in organic, inorganic, physical, analytical, theoretical, and biological chemistry. Since 1985, Chemical Reviews has also published periodic thematic issues that focus on a single theme or direction of emerging research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信