机器学习电位的并行主动学习。

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-06-22 DOI:10.1039/D5DD00073D

Chen Zhou, Marlen Neubert, Yuri Koide, Yumeng Zhang, Van-Quan Vuong, Tobias Schlöder, Stefanie Dehnen and Pascal Friederich

{"title":"机器学习电位的并行主动学习。","authors":"Chen Zhou, Marlen Neubert, Yuri Koide, Yumeng Zhang, Van-Quan Vuong, Tobias Schlöder, Stefanie Dehnen and Pascal Friederich","doi":"10.1039/D5DD00073D","DOIUrl":null,"url":null,"abstract":"<p >Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios – including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics – illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1901-1911"},"PeriodicalIF":6.2000,"publicationDate":"2025-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12188519/pdf/","citationCount":"0","resultStr":"{\"title\":\"PAL – parallel active learning for machine-learned potentials†\",\"authors\":\"Chen Zhou, Marlen Neubert, Yuri Koide, Yumeng Zhang, Van-Quan Vuong, Tobias Schlöder, Stefanie Dehnen and Pascal Friederich\",\"doi\":\"10.1039/D5DD00073D\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios – including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics – illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.</p>\",\"PeriodicalId\":72816,\"journal\":{\"name\":\"Digital discovery\",\"volume\":\" 7\",\"pages\":\" 1901-1911\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12188519/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00073d\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00073d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

构建代表目标领域的数据集对于训练有效的机器学习模型至关重要。主动学习（AL）是一种很有前途的方法，它迭代地扩展训练数据以提高模型性能，同时最小化数据获取成本。然而，当前的人工智能工作流通常需要人工干预并且缺乏并行性，从而导致效率低下和现代计算资源的利用不足。在这项工作中，我们介绍了PAL，一个自动化、模块化和并行的主动学习库，它集成了ai任务，并使用消息传递接口（MPI）管理它们在共享和分布式内存系统上的执行和通信。PAL为用户提供了设计和定制其主动学习场景的所有组件的灵活性，包括具有不确定性估计的机器学习模型、用于地面真值标记的预言机以及用于探索目标空间的策略。我们证明了PAL显著降低了计算开销并提高了可伸缩性，通过CPU和GPU硬件上的异步并行实现了显著的加速。PAL在一些现实场景中的应用——包括生物分子系统中的基态反应、分子的激发态动力学、无机簇的模拟和热流体动力学——说明了它在加速机器学习模型发展方面的有效性。我们的研究结果表明，PAL能够在主动学习工作流程中有效地利用高性能计算资源，促进科学研究和工程应用的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

PAL – parallel active learning for machine-learned potentials†

查看原文本刊更多论文

PAL – parallel active learning for machine-learned potentials†

Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios – including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics – illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量