Transferability of datasets between Machine-Learning Interaction Potentials

Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi
{"title":"Transferability of datasets between Machine-Learning Interaction Potentials","authors":"Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi","doi":"arxiv-2409.05590","DOIUrl":null,"url":null,"abstract":"With the emergence of Foundational Machine Learning Interatomic Potential\n(FMLIP) models trained on extensive datasets, transferring data between\ndifferent ML architectures has become increasingly important. In this work, we\nexamine the extent to which training data optimised for one machine-learning\nforcefield algorithm may be re-used to train different models, aiming to\naccelerate FMLIP fine-tuning and to reduce the need for costly iterative\ntraining. As a test case, we train models of an organic liquid mixture that is\ncommonly used as a solvent in rechargeable battery electrolytes, making it an\nimportant target for reactive MLIP development. We assess model performance by\nanalysing the properties of molecular dynamics trajectories, showing that this\nis a more stringent test than comparing prediction errors for fixed datasets.\nWe consider several types of training data, and several popular MLIPs - notably\nthe recent MACE architecture, a message-passing neural network designed for\nhigh efficiency and smoothness. We demonstrate that simple training sets\nconstructed without any ab initio dynamics are sufficient to produce stable\nmodels of molecular liquids. For simple neural-network architectures, further\niterative training is required to capture thermodynamic and kinetic properties\ncorrectly, but MACE performs well with extremely limited datsets. We find that\nconfigurations designed by human intuition to correct systematic model\ndeficiencies transfer effectively between algorithms, but active-learned data\nthat are generated by one MLIP do not typically benefit a different algorithm.\nFinally, we show that any training data which improve model performance also\nimprove its ability to generalise to similar unseen molecules. This suggests\nthat trajectory failure modes are connected with chemical structure rather than\nbeing entirely system-specific.","PeriodicalId":501304,"journal":{"name":"arXiv - PHYS - Chemical Physics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Chemical Physics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the emergence of Foundational Machine Learning Interatomic Potential (FMLIP) models trained on extensive datasets, transferring data between different ML architectures has become increasingly important. In this work, we examine the extent to which training data optimised for one machine-learning forcefield algorithm may be re-used to train different models, aiming to accelerate FMLIP fine-tuning and to reduce the need for costly iterative training. As a test case, we train models of an organic liquid mixture that is commonly used as a solvent in rechargeable battery electrolytes, making it an important target for reactive MLIP development. We assess model performance by analysing the properties of molecular dynamics trajectories, showing that this is a more stringent test than comparing prediction errors for fixed datasets. We consider several types of training data, and several popular MLIPs - notably the recent MACE architecture, a message-passing neural network designed for high efficiency and smoothness. We demonstrate that simple training sets constructed without any ab initio dynamics are sufficient to produce stable models of molecular liquids. For simple neural-network architectures, further iterative training is required to capture thermodynamic and kinetic properties correctly, but MACE performs well with extremely limited datsets. We find that configurations designed by human intuition to correct systematic model deficiencies transfer effectively between algorithms, but active-learned data that are generated by one MLIP do not typically benefit a different algorithm. Finally, we show that any training data which improve model performance also improve its ability to generalise to similar unseen molecules. This suggests that trajectory failure modes are connected with chemical structure rather than being entirely system-specific.
机器学习交互潜力之间数据集的可转移性
随着在大量数据集上训练的基础机器学习原子间势能(FMLIP)模型的出现,在不同机器学习架构之间传输数据变得越来越重要。在这项工作中,我们研究了针对一种机器学习力场算法进行优化的训练数据在多大程度上可以重新用于训练不同的模型,旨在加快 FMLIP 的微调速度,并减少对昂贵的迭代训练的需求。作为一个测试案例,我们训练了一种有机液体混合物的模型,这种混合物通常用作充电电池电解液的溶剂,因此成为反应式 MLIP 开发的重要目标。我们通过分析分子动力学轨迹的特性来评估模型性能,结果表明这是一项比比较固定数据集的预测误差更为严格的测试。我们考虑了几种类型的训练数据和几种流行的 MLIP,特别是最近的 MACE 架构,这是一种为实现高效率和平滑性而设计的消息传递神经网络。我们证明,在没有任何原子动力学基础上构建的简单训练集足以生成稳定的分子液体模型。对于简单的神经网络架构,需要进一步的迭代训练才能正确捕捉热力学和动力学性质,但 MACE 在极其有限的训练集上也表现出色。我们发现,根据人类直觉设计的用于纠正系统模型缺陷的配置可以在不同算法之间有效转换,但由一种 MLIP 生成的主动学习数据通常不会使不同算法受益。这表明轨迹失效模式与化学结构有关,而不完全是系统特有的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信