Yue Liu, Shuchang Ma, Zhengwei Yang, Duo Wu, Yali Zhao, Maxim Avdeev, Siqi Shi
{"title":"Domain knowledge-assisted materials data anomaly detection towards constructing high-performance machine learning models","authors":"Yue Liu, Shuchang Ma, Zhengwei Yang, Duo Wu, Yali Zhao, Maxim Avdeev, Siqi Shi","doi":"10.1016/j.jmat.2025.101066","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) is widely applied to accelerate materials design and discovery due to its outperforming capability of data analysis and information extraction. However, experimental and computational errors typically lead to emerging data anomalies, harming the performance of ML models. Most currently used anomaly detection methods are purely data-driven, which has limited capability of learning complicated factors in materials data. Here, we propose a domain knowledge-assisted data anomaly detection (DKA-DAD) workflow, where materials domain knowledge is encoded as symbolic rules. Three detection models are designed for evaluating the correctness of individual descriptor value, correlation between descriptors, and similarity between samples, respectively, and one modification model is constructed for comprehensive governance. We construct 180 synthetic datasets by injecting noise into 60 structured materials datasets collected from materials ML studies, to validate its potential utility and applications. DKA-DAD achieves a 12% F1-score improvement in anomaly detection accuracy on synthetic datasets compared to purely data-driven approach and the ML models trained on materials datasets processed through DKA exhibit an average 9.6% improvement in R<sup>2</sup> for the property prediction. Our work provides a data anomaly detecting approach under the guidance of materials domain knowledge towards accelerating materials design and discovery based on ML.","PeriodicalId":16173,"journal":{"name":"Journal of Materiomics","volume":"14 1","pages":""},"PeriodicalIF":8.4000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Materiomics","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.1016/j.jmat.2025.101066","RegionNum":1,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) is widely applied to accelerate materials design and discovery due to its outperforming capability of data analysis and information extraction. However, experimental and computational errors typically lead to emerging data anomalies, harming the performance of ML models. Most currently used anomaly detection methods are purely data-driven, which has limited capability of learning complicated factors in materials data. Here, we propose a domain knowledge-assisted data anomaly detection (DKA-DAD) workflow, where materials domain knowledge is encoded as symbolic rules. Three detection models are designed for evaluating the correctness of individual descriptor value, correlation between descriptors, and similarity between samples, respectively, and one modification model is constructed for comprehensive governance. We construct 180 synthetic datasets by injecting noise into 60 structured materials datasets collected from materials ML studies, to validate its potential utility and applications. DKA-DAD achieves a 12% F1-score improvement in anomaly detection accuracy on synthetic datasets compared to purely data-driven approach and the ML models trained on materials datasets processed through DKA exhibit an average 9.6% improvement in R2 for the property prediction. Our work provides a data anomaly detecting approach under the guidance of materials domain knowledge towards accelerating materials design and discovery based on ML.
期刊介绍:
The Journal of Materiomics is a peer-reviewed open-access journal that aims to serve as a forum for the continuous dissemination of research within the field of materials science. It particularly emphasizes systematic studies on the relationships between composition, processing, structure, property, and performance of advanced materials. The journal is supported by the Chinese Ceramic Society and is indexed in SCIE and Scopus. It is commonly referred to as J Materiomics.