A general framework to govern machine learning oriented materials data quality

IF 31.6 1区 材料科学 Q1 MATERIALS SCIENCE, MULTIDISCIPLINARY
Yue Liu , Zhengwei Yang , Xinxin Zou , Yuxiao Lin , Shuchang Ma , Wei Zuo , Zheyi Zou , Hong Wang , Maxim Avdeev , Siqi Shi
{"title":"A general framework to govern machine learning oriented materials data quality","authors":"Yue Liu ,&nbsp;Zhengwei Yang ,&nbsp;Xinxin Zou ,&nbsp;Yuxiao Lin ,&nbsp;Shuchang Ma ,&nbsp;Wei Zuo ,&nbsp;Zheyi Zou ,&nbsp;Hong Wang ,&nbsp;Maxim Avdeev ,&nbsp;Siqi Shi","doi":"10.1016/j.mser.2025.101050","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning (ML) is increasingly applied in materials discovery and property prediction, mainly due to its advantage of low-cost and efficient data analysis process. The materials data quality can heavily influence the performance of ML models. However, most current data quality improvement approaches are purely data-driven, neglecting materials domain knowledge and data quality issues latent in the entire process of ML modelling. Here, we address the definition of high-quality data and propose a general framework for ML-oriented MATerials Data Quality Governance incorporating domain knowledge (MAT-DQG), involving nine dimensions defining WHAT materials data quality should be evaluated, lifecycle models guiding WHEN to execute data governance activities in the entire process of ML modelling, and processing models guiding HOW to detect and address issues related to materials data quality. 60 datasets from materials ML studies are assembled to demonstrate potential utility and applications of MAT-DQG, including mining complicated structure-activity relationships in metals, inorganic non-metals, polymers, and composite materials. MAT-DQG identifies and resolves issues in 17 datasets and as a result prediction accuracy improvements of up to 49 % are achieved. Our work lays a foundation for governing ML-oriented materials data and ensuring its reusability and reliability, which advances the frontiers of materials discovery and design.</div></div>","PeriodicalId":386,"journal":{"name":"Materials Science and Engineering: R: Reports","volume":"166 ","pages":"Article 101050"},"PeriodicalIF":31.6000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Science and Engineering: R: Reports","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0927796X25001275","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) is increasingly applied in materials discovery and property prediction, mainly due to its advantage of low-cost and efficient data analysis process. The materials data quality can heavily influence the performance of ML models. However, most current data quality improvement approaches are purely data-driven, neglecting materials domain knowledge and data quality issues latent in the entire process of ML modelling. Here, we address the definition of high-quality data and propose a general framework for ML-oriented MATerials Data Quality Governance incorporating domain knowledge (MAT-DQG), involving nine dimensions defining WHAT materials data quality should be evaluated, lifecycle models guiding WHEN to execute data governance activities in the entire process of ML modelling, and processing models guiding HOW to detect and address issues related to materials data quality. 60 datasets from materials ML studies are assembled to demonstrate potential utility and applications of MAT-DQG, including mining complicated structure-activity relationships in metals, inorganic non-metals, polymers, and composite materials. MAT-DQG identifies and resolves issues in 17 datasets and as a result prediction accuracy improvements of up to 49 % are achieved. Our work lays a foundation for governing ML-oriented materials data and ensuring its reusability and reliability, which advances the frontiers of materials discovery and design.
管理面向机器学习的材料数据质量的通用框架
机器学习(ML)越来越多地应用于材料发现和性能预测,主要是由于其低成本和高效的数据分析过程的优势。材料的数据质量会严重影响机器学习模型的性能。然而,目前大多数数据质量改进方法纯粹是数据驱动的,忽略了材料领域知识和ML建模整个过程中潜在的数据质量问题。在这里,我们讨论了高质量数据的定义,并提出了一个包含领域知识(MAT-DQG)的面向机器学习的材料数据质量治理的通用框架,涉及九个维度,定义了应该评估哪些材料数据质量,生命周期模型指导何时在机器学习建模的整个过程中执行数据治理活动,处理模型指导如何检测和解决与材料数据质量相关的问题。从材料ML研究中收集了60个数据集,以展示MAT-DQG的潜在效用和应用,包括挖掘金属,无机非金属,聚合物和复合材料中的复杂结构-活性关系。MAT-DQG识别并解决了17个数据集中的问题,结果实现了高达49% %的预测精度提高。我们的工作为管理面向机器学习的材料数据并确保其可重用性和可靠性奠定了基础,从而推进了材料发现和设计的前沿。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Materials Science and Engineering: R: Reports
Materials Science and Engineering: R: Reports 工程技术-材料科学:综合
CiteScore
60.50
自引率
0.30%
发文量
19
审稿时长
34 days
期刊介绍: Materials Science & Engineering R: Reports is a journal that covers a wide range of topics in the field of materials science and engineering. It publishes both experimental and theoretical research papers, providing background information and critical assessments on various topics. The journal aims to publish high-quality and novel research papers and reviews. The subject areas covered by the journal include Materials Science (General), Electronic Materials, Optical Materials, and Magnetic Materials. In addition to regular issues, the journal also publishes special issues on key themes in the field of materials science, including Energy Materials, Materials for Health, Materials Discovery, Innovation for High Value Manufacturing, and Sustainable Materials development.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信