Yue Liu , Zhengwei Yang , Xinxin Zou , Yuxiao Lin , Shuchang Ma , Wei Zuo , Zheyi Zou , Hong Wang , Maxim Avdeev , Siqi Shi
{"title":"A general framework to govern machine learning oriented materials data quality","authors":"Yue Liu , Zhengwei Yang , Xinxin Zou , Yuxiao Lin , Shuchang Ma , Wei Zuo , Zheyi Zou , Hong Wang , Maxim Avdeev , Siqi Shi","doi":"10.1016/j.mser.2025.101050","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) is increasingly applied in materials discovery and property prediction, mainly due to its advantage of low-cost and efficient data analysis process. The materials data quality can heavily influence the performance of ML models. However, most current data quality improvement approaches are purely data-driven, neglecting materials domain knowledge and data quality issues latent in the entire process of ML modelling. Here, we address the definition of high-quality data and propose a general framework for ML-oriented MATerials Data Quality Governance incorporating domain knowledge (MAT-DQG), involving nine dimensions defining WHAT materials data quality should be evaluated, lifecycle models guiding WHEN to execute data governance activities in the entire process of ML modelling, and processing models guiding HOW to detect and address issues related to materials data quality. 60 datasets from materials ML studies are assembled to demonstrate potential utility and applications of MAT-DQG, including mining complicated structure-activity relationships in metals, inorganic non-metals, polymers, and composite materials. MAT-DQG identifies and resolves issues in 17 datasets and as a result prediction accuracy improvements of up to 49 % are achieved. Our work lays a foundation for governing ML-oriented materials data and ensuring its reusability and reliability, which advances the frontiers of materials discovery and design.</div></div>","PeriodicalId":386,"journal":{"name":"Materials Science and Engineering: R: Reports","volume":"166 ","pages":"Article 101050"},"PeriodicalIF":31.6000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Science and Engineering: R: Reports","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0927796X25001275","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) is increasingly applied in materials discovery and property prediction, mainly due to its advantage of low-cost and efficient data analysis process. The materials data quality can heavily influence the performance of ML models. However, most current data quality improvement approaches are purely data-driven, neglecting materials domain knowledge and data quality issues latent in the entire process of ML modelling. Here, we address the definition of high-quality data and propose a general framework for ML-oriented MATerials Data Quality Governance incorporating domain knowledge (MAT-DQG), involving nine dimensions defining WHAT materials data quality should be evaluated, lifecycle models guiding WHEN to execute data governance activities in the entire process of ML modelling, and processing models guiding HOW to detect and address issues related to materials data quality. 60 datasets from materials ML studies are assembled to demonstrate potential utility and applications of MAT-DQG, including mining complicated structure-activity relationships in metals, inorganic non-metals, polymers, and composite materials. MAT-DQG identifies and resolves issues in 17 datasets and as a result prediction accuracy improvements of up to 49 % are achieved. Our work lays a foundation for governing ML-oriented materials data and ensuring its reusability and reliability, which advances the frontiers of materials discovery and design.
期刊介绍:
Materials Science & Engineering R: Reports is a journal that covers a wide range of topics in the field of materials science and engineering. It publishes both experimental and theoretical research papers, providing background information and critical assessments on various topics. The journal aims to publish high-quality and novel research papers and reviews.
The subject areas covered by the journal include Materials Science (General), Electronic Materials, Optical Materials, and Magnetic Materials. In addition to regular issues, the journal also publishes special issues on key themes in the field of materials science, including Energy Materials, Materials for Health, Materials Discovery, Innovation for High Value Manufacturing, and Sustainable Materials development.