{"title":"机器学习辅助材料发现:小数据方法","authors":"Qionghua Zhou*, Xinyu Chen and Jinlan Wang*, ","doi":"10.1021/accountsmr.1c00236","DOIUrl":null,"url":null,"abstract":"<p >The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.</p><p >In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.</p><p >The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.</p>","PeriodicalId":72040,"journal":{"name":"Accounts of materials research","volume":"6 6","pages":"685–694"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine Learning Assisted Material Discovery: A Small Data Approach\",\"authors\":\"Qionghua Zhou*, Xinyu Chen and Jinlan Wang*, \",\"doi\":\"10.1021/accountsmr.1c00236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.</p><p >In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.</p><p >The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.</p>\",\"PeriodicalId\":72040,\"journal\":{\"name\":\"Accounts of materials research\",\"volume\":\"6 6\",\"pages\":\"685–694\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of materials research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/accountsmr.1c00236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of materials research","FirstCategoryId":"1085","ListUrlMain":"https://pubs.acs.org/doi/10.1021/accountsmr.1c00236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Machine Learning Assisted Material Discovery: A Small Data Approach
The data-driven paradigm, represented by the famous machine learning paradigm, is revolutionizing the way materials are discovered. The inductive nature of the data-driven approach gives it great speed of prediction but also brings with it a heavy reliance on material data. However, unlike its success with text and images, which are supported by big data, materials data tend to be small data. Building a large database of materials is a good solution but not a permanent one. The cost of materials data is much higher than that of text or images, and the size of the materials database at this stage is far from sufficient. We will continue to face a shortage of materials data for a long time to come, making small data approaches necessary for machine learning based materials discovery.
In this Account, we focus on small data strategies developed over the past few years and the scenarios in which they are used. In the first part, we discuss two general strategies, active learning and transfer learning, which are ways of adding new data efficiently and using existing data, respectively. The key to active learning is the sampling strategy, which determines the speed of convergence and the predictive range of the machine learning model. For transfer learning, adversarial training is introduced to extend the scope of this strategy, allowing for knowledge transfer across materials and properties. We also discuss other small data approaches for special cases, such as material search with zero initial data and model training on multisource experimental data. In the second part, we focus on the construction of material descriptors and reduction of their dimensionality. We have developed a crystal-graph-based descriptor specifically for two-dimensional materials. It can encode both structural and atomic information and also has a flexible multilayer format for different target properties. Since the dimensionality of the material descriptor is limited by the amount of data, specially designed dimensionality reduction strategies are also discussed. In the third part, we discuss model interpretability. Several examples are given to illustrate how model-based and data-based interpretation strategies can be used to help us understand the machine learning model and its prediction results.
The Account concludes with our perspectives on the latest developments in generative AI (in particular, large language model and diffusion model) and explainable AI, which could be powerful tools in the future of machine learning assisted material discovery.