{"title":"Automated generation of structure datasets for machine learning potentials and alloys","authors":"Marvin Poul, Liam Huber, Jörg Neugebauer","doi":"10.1038/s41524-025-01669-4","DOIUrl":null,"url":null,"abstract":"<p>We propose a strategy for generating unbiased and systematically extendable training data for machine learning interatomic potentials (MLIP) for multicomponent alloys, called <i>Automated Small SYmmetric Structure Training</i> or <i>ASSYST</i>. Based on exploring the full space of random crystal structures with space groups, it facilitates the construction of training sets for MLIPs in an automatic way without prior knowledge of the material in question. The advantages of this approach are that only cells consisting of few atoms (≈ 10) are needed for the DFT training set, and the size and completeness of the data set can be systematically controlled with very few parameters. We validate that potentials fitted this way can accurately describe a wide range of binary and ternary phases, random alloys, as well as point and extended defects, that have not been part of the training set. Finally, we estimate the binary phase diagrams with good experimental agreement. We demonstrate that the overall excellent performance is not a coincidence, but a consequence of the extensive sampling in phase space of <i>ASSYST</i>. Overall, this means that <i>ASSYST</i> will enable the largely autonomous generation of high-quality DFT reference data and MLIPs.</p>","PeriodicalId":19342,"journal":{"name":"npj Computational Materials","volume":"5 1","pages":""},"PeriodicalIF":9.4000,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"npj Computational Materials","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.1038/s41524-025-01669-4","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a strategy for generating unbiased and systematically extendable training data for machine learning interatomic potentials (MLIP) for multicomponent alloys, called Automated Small SYmmetric Structure Training or ASSYST. Based on exploring the full space of random crystal structures with space groups, it facilitates the construction of training sets for MLIPs in an automatic way without prior knowledge of the material in question. The advantages of this approach are that only cells consisting of few atoms (≈ 10) are needed for the DFT training set, and the size and completeness of the data set can be systematically controlled with very few parameters. We validate that potentials fitted this way can accurately describe a wide range of binary and ternary phases, random alloys, as well as point and extended defects, that have not been part of the training set. Finally, we estimate the binary phase diagrams with good experimental agreement. We demonstrate that the overall excellent performance is not a coincidence, but a consequence of the extensive sampling in phase space of ASSYST. Overall, this means that ASSYST will enable the largely autonomous generation of high-quality DFT reference data and MLIPs.
期刊介绍:
npj Computational Materials is a high-quality open access journal from Nature Research that publishes research papers applying computational approaches for the design of new materials and enhancing our understanding of existing ones. The journal also welcomes papers on new computational techniques and the refinement of current approaches that support these aims, as well as experimental papers that complement computational findings.
Some key features of npj Computational Materials include a 2-year impact factor of 12.241 (2021), article downloads of 1,138,590 (2021), and a fast turnaround time of 11 days from submission to the first editorial decision. The journal is indexed in various databases and services, including Chemical Abstracts Service (ACS), Astrophysics Data System (ADS), Current Contents/Physical, Chemical and Earth Sciences, Journal Citation Reports/Science Edition, SCOPUS, EI Compendex, INSPEC, Google Scholar, SCImago, DOAJ, CNKI, and Science Citation Index Expanded (SCIE), among others.