Maksim Kulichenko, Benjamin Nebgen, Nicholas Lubbers, Justin S. Smith, Kipton Barros, Alice E. A. Allen, Adela Habib, Emily Shinkle, Nikita Fedik, Ying Wai Li, Richard A. Messerly, Sergei Tretiak
{"title":"Data Generation for Machine Learning Interatomic Potentials and Beyond","authors":"Maksim Kulichenko, Benjamin Nebgen, Nicholas Lubbers, Justin S. Smith, Kipton Barros, Alice E. A. Allen, Adela Habib, Emily Shinkle, Nikita Fedik, Ying Wai Li, Richard A. Messerly, Sergei Tretiak","doi":"10.1021/acs.chemrev.4c00572","DOIUrl":null,"url":null,"abstract":"The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.","PeriodicalId":32,"journal":{"name":"Chemical Reviews","volume":"55 1","pages":""},"PeriodicalIF":51.4000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Reviews","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.chemrev.4c00572","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.
在用于预测分子性质和行为的机器学习模型创新的推动下,数据驱动化学领域正在经历一场变革。最近,基于 ML 的原子间位势技术取得了长足进步,为在原子水平上准确模拟各种化学和结构特性铺平了道路。决定 MLIP 可靠性的关键因素仍然是训练数据的质量。最重要的挑战在于构建能捕捉广阔化学和结构空间中特定领域的训练集。本综述介绍了训练数据的基本组成部分和完整性的复杂情况,以确保所生成模型的可扩展性和可转移性。我们深入探讨了主动学习的细节,讨论了其各个方面和实现方法。我们概述了应用于原子数据采集的不同类型的不确定性量化,以及估计的不确定性与真实误差之间的相关性。我们还强调了原子数据采样器在生成多样化信息结构中的作用。此外,我们还讨论了通过修正势能面和替代势能面获取数据,以此作为使训练数据多样化的创新方法。本综述还提供了一份涵盖化学空间重要领域的公开可用数据集列表。
期刊介绍:
Chemical Reviews is a highly regarded and highest-ranked journal covering the general topic of chemistry. Its mission is to provide comprehensive, authoritative, critical, and readable reviews of important recent research in organic, inorganic, physical, analytical, theoretical, and biological chemistry.
Since 1985, Chemical Reviews has also published periodic thematic issues that focus on a single theme or direction of emerging research.