Stefan Heinen, Danish Khan, Guido Falk von Rudorff, Konstantin Karandashev, Daniel Jose Arismendi Arrieta, Alastair J A Price, Surajit Nandi, Arghya Bhowmik, Kersti Hermansson, O Anatole von Lilienfeld
{"title":"Reducing training data needs with minimal multilevel machine learning (M3L)","authors":"Stefan Heinen, Danish Khan, Guido Falk von Rudorff, Konstantin Karandashev, Daniel Jose Arismendi Arrieta, Alastair J A Price, Surajit Nandi, Arghya Bhowmik, Kersti Hermansson, O Anatole von Lilienfeld","doi":"10.1088/2632-2153/ad4ae5","DOIUrl":null,"url":null,"abstract":"For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{LCCSD(T)}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msup><mml:mi></mml:mi><mml:mrow><mml:mi>LCCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn1.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, Electrolyte Genome Project, QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{CCSD(T)}_\\mathrm{AE}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>AE</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn2.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, and QM9<inline-formula>\n<tex-math><?CDATA $^\\mathrm{CCSD(T)}_\\mathrm{EA}$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:msubsup><mml:mi></mml:mi><mml:mrow><mml:mi>EA</mml:mi></mml:mrow><mml:mrow><mml:mi>CCSD</mml:mi><mml:mo stretchy=\"false\">(</mml:mo><mml:mi mathvariant=\"normal\">T</mml:mi><mml:mo stretchy=\"false\">)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn3.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>, respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and <italic toggle=\"yes\">τ</italic>-HCTH, B3LYP<inline-formula>\n<tex-math><?CDATA $\\ast$?></tex-math>\n<mml:math overflow=\"scroll\"><mml:mrow><mml:mo>∗</mml:mo></mml:mrow></mml:math>\n<inline-graphic xlink:href=\"mlstad4ae5ieqn4.gif\" xlink:type=\"simple\"></inline-graphic>\n</inline-formula>(VWN5), and TPSSH.","PeriodicalId":33757,"journal":{"name":"Machine Learning Science and Technology","volume":null,"pages":null},"PeriodicalIF":6.3000,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning Science and Technology","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1088/2632-2153/ad4ae5","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9LCCSD(T), Electrolyte Genome Project, QM9AECCSD(T), and QM9EACCSD(T), respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and τ-HCTH, B3LYP∗(VWN5), and TPSSH.
期刊介绍:
Machine Learning Science and Technology is a multidisciplinary open access journal that bridges the application of machine learning across the sciences with advances in machine learning methods and theory as motivated by physical insights. Specifically, articles must fall into one of the following categories: advance the state of machine learning-driven applications in the sciences or make conceptual, methodological or theoretical advances in machine learning with applications to, inspiration from, or motivated by scientific problems.