面向高性能机器学习模型的领域知识辅助材料数据异常检测

IF 8.4 1区 材料科学 Q1 CHEMISTRY, PHYSICAL
Yue Liu, Shuchang Ma, Zhengwei Yang, Duo Wu, Yali Zhao, Maxim Avdeev, Siqi Shi
{"title":"面向高性能机器学习模型的领域知识辅助材料数据异常检测","authors":"Yue Liu, Shuchang Ma, Zhengwei Yang, Duo Wu, Yali Zhao, Maxim Avdeev, Siqi Shi","doi":"10.1016/j.jmat.2025.101066","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) is widely applied to accelerate materials design and discovery due to its outperforming capability of data analysis and information extraction. However, experimental and computational errors typically lead to emerging data anomalies, harming the performance of ML models. Most currently used anomaly detection methods are purely data-driven, which has limited capability of learning complicated factors in materials data. Here, we propose a domain knowledge-assisted data anomaly detection (DKA-DAD) workflow, where materials domain knowledge is encoded as symbolic rules. Three detection models are designed for evaluating the correctness of individual descriptor value, correlation between descriptors, and similarity between samples, respectively, and one modification model is constructed for comprehensive governance. We construct 180 synthetic datasets by injecting noise into 60 structured materials datasets collected from materials ML studies, to validate its potential utility and applications. DKA-DAD achieves a 12% F1-score improvement in anomaly detection accuracy on synthetic datasets compared to purely data-driven approach and the ML models trained on materials datasets processed through DKA exhibit an average 9.6% improvement in R<sup>2</sup> for the property prediction. Our work provides a data anomaly detecting approach under the guidance of materials domain knowledge towards accelerating materials design and discovery based on ML.","PeriodicalId":16173,"journal":{"name":"Journal of Materiomics","volume":"14 1","pages":""},"PeriodicalIF":8.4000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Domain knowledge-assisted materials data anomaly detection towards constructing high-performance machine learning models\",\"authors\":\"Yue Liu, Shuchang Ma, Zhengwei Yang, Duo Wu, Yali Zhao, Maxim Avdeev, Siqi Shi\",\"doi\":\"10.1016/j.jmat.2025.101066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning (ML) is widely applied to accelerate materials design and discovery due to its outperforming capability of data analysis and information extraction. However, experimental and computational errors typically lead to emerging data anomalies, harming the performance of ML models. Most currently used anomaly detection methods are purely data-driven, which has limited capability of learning complicated factors in materials data. Here, we propose a domain knowledge-assisted data anomaly detection (DKA-DAD) workflow, where materials domain knowledge is encoded as symbolic rules. Three detection models are designed for evaluating the correctness of individual descriptor value, correlation between descriptors, and similarity between samples, respectively, and one modification model is constructed for comprehensive governance. We construct 180 synthetic datasets by injecting noise into 60 structured materials datasets collected from materials ML studies, to validate its potential utility and applications. DKA-DAD achieves a 12% F1-score improvement in anomaly detection accuracy on synthetic datasets compared to purely data-driven approach and the ML models trained on materials datasets processed through DKA exhibit an average 9.6% improvement in R<sup>2</sup> for the property prediction. Our work provides a data anomaly detecting approach under the guidance of materials domain knowledge towards accelerating materials design and discovery based on ML.\",\"PeriodicalId\":16173,\"journal\":{\"name\":\"Journal of Materiomics\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2025-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Materiomics\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jmat.2025.101066\",\"RegionNum\":1,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Materiomics","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.1016/j.jmat.2025.101066","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

摘要

机器学习以其优异的数据分析和信息提取能力被广泛应用于材料的设计和发现。然而,实验和计算错误通常会导致新出现的数据异常,从而损害ML模型的性能。目前使用的异常检测方法大多是纯数据驱动的,对材料数据中复杂因素的学习能力有限。在此,我们提出了一个领域知识辅助数据异常检测(DKA-DAD)工作流,其中材料领域知识被编码为符号规则。设计了三个检测模型,分别用于评估单个描述符值的正确性、描述符之间的相关性和样本之间的相似性,并构建了一个修改模型进行综合治理。我们通过将噪声注入到从材料ML研究中收集的60个结构化材料数据集中,构建了180个合成数据集,以验证其潜在的效用和应用。与纯粹的数据驱动方法相比,DKA- dad在合成数据集上的异常检测精度提高了12%,通过DKA处理的材料数据集训练的ML模型在属性预测方面的R2平均提高了9.6%。我们的工作提供了一种在材料领域知识指导下的数据异常检测方法,以加速基于ML的材料设计和发现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Domain knowledge-assisted materials data anomaly detection towards constructing high-performance machine learning models

Domain knowledge-assisted materials data anomaly detection towards constructing high-performance machine learning models
Machine learning (ML) is widely applied to accelerate materials design and discovery due to its outperforming capability of data analysis and information extraction. However, experimental and computational errors typically lead to emerging data anomalies, harming the performance of ML models. Most currently used anomaly detection methods are purely data-driven, which has limited capability of learning complicated factors in materials data. Here, we propose a domain knowledge-assisted data anomaly detection (DKA-DAD) workflow, where materials domain knowledge is encoded as symbolic rules. Three detection models are designed for evaluating the correctness of individual descriptor value, correlation between descriptors, and similarity between samples, respectively, and one modification model is constructed for comprehensive governance. We construct 180 synthetic datasets by injecting noise into 60 structured materials datasets collected from materials ML studies, to validate its potential utility and applications. DKA-DAD achieves a 12% F1-score improvement in anomaly detection accuracy on synthetic datasets compared to purely data-driven approach and the ML models trained on materials datasets processed through DKA exhibit an average 9.6% improvement in R2 for the property prediction. Our work provides a data anomaly detecting approach under the guidance of materials domain knowledge towards accelerating materials design and discovery based on ML.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Materiomics
Journal of Materiomics Materials Science-Metals and Alloys
CiteScore
14.30
自引率
6.40%
发文量
331
审稿时长
37 days
期刊介绍: The Journal of Materiomics is a peer-reviewed open-access journal that aims to serve as a forum for the continuous dissemination of research within the field of materials science. It particularly emphasizes systematic studies on the relationships between composition, processing, structure, property, and performance of advanced materials. The journal is supported by the Chinese Ceramic Society and is indexed in SCIE and Scopus. It is commonly referred to as J Materiomics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信