机器学习辅助材料发现：小数据方法

IF 14.7 Q1 CHEMISTRY, MULTIDISCIPLINARY

Accounts of materials research Pub Date : 2025-05-22 DOI:10.1021/accountsmr.1c00236

Qionghua Zhou*, Xinyu Chen and Jinlan Wang*,

{"title":"机器学习辅助材料发现：小数据方法","authors":"Qionghua Zhou*, Xinyu Chen and Jinlan Wang*, ","doi":"10.1021/accountsmr.1c00236","DOIUrl":null,"url":null,"abstract":"The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.","PeriodicalId":72040,"journal":{"name":"Accounts of materials research","volume":"6 6","pages":"685–694"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning Assisted Material Discovery: A Small Data Approach\",\"authors\":\"Qionghua Zhou*, Xinyu Chen and Jinlan Wang*, \",\"doi\":\"10.1021/accountsmr.1c00236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.\",\"PeriodicalId\":72040,\"journal\":{\"name\":\"Accounts of materials research\",\"volume\":\"6 6\",\"pages\":\"685–694\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of materials research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/accountsmr.1c00236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of materials research","FirstCategoryId":"1085","ListUrlMain":"https://pubs.acs.org/doi/10.1021/accountsmr.1c00236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

以著名的机器学习范式为代表的数据驱动范式正在彻底改变材料的发现方式。数据驱动方法的归纳性质使其具有很高的预测速度，但也带来了对材料数据的严重依赖。然而，与大数据支持的文本和图像不同，材料数据往往是小数据。建立一个大型的材料数据库是一个很好的解决方案，但不是一个永久的解决方案。材料数据的成本远高于文本或图像，并且现阶段材料数据库的规模远远不够。在未来很长一段时间内，我们将继续面临材料数据短缺的问题，这使得基于机器学习的材料发现需要小数据方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Machine Learning Assisted Material Discovery: A Small Data Approach

查看原文本刊更多论文

Machine Learning Assisted Material Discovery: A Small Data Approach

The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.

In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.

The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Accounts of materials research

CiteScore

17.70

自引率

0.00%

发文量