基于最近邻和迭代预测的材料科学缺失数据的归算

IF 5.5 1区化学 Q2 CHEMISTRY, PHYSICAL

Journal of Chemical Theory and Computation Pub Date : 2024-12-26 DOI:10.1021/acs.jctc.4c0123710.1021/acs.jctc.4c01237

Chunhui Xie, Rui Li, Yunqi Li*, Haibo Xie and Qibin Liu,

{"title":"基于最近邻和迭代预测的材料科学缺失数据的归算","authors":"Chunhui Xie, Rui Li, Yunqi Li*, Haibo Xie and Qibin Liu, ","doi":"10.1021/acs.jctc.4c0123710.1021/acs.jctc.4c01237","DOIUrl":null,"url":null,"abstract":"<p >Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science. The imputation-induced errors (IIEs) were evaluated through the difference between imputed and original values, by root mean square error (RMSE), Wasserstein distance (WD), and a newly introduced metrics data set correlation convergence (DCC), to measure the difference at three aspects for individual data, column-wise distribution, and correlation stability of a data set. MatImpute outperformed the others with the least RMSE and WD and the highest DCC. The IIE increases with the increase of data missing ratio and in the order of missing at random < missing completely at random ≤ missing not at random, considering inherent correlations among missing data. A similar trend was observed for the increase of IIE along the central departure distance in units of the standard deviation, which is consistent with the increase of difficulty from interpolation to extrapolation. Further tests of IIE in regression and classification machine learning predictive models, MatImpute also preserved the highest data recovery fidelity. We released the code of MatImpute to facilitate the construction of high-quality data sets in materials science.</p>","PeriodicalId":45,"journal":{"name":"Journal of Chemical Theory and Computation","volume":"21 1","pages":"70–78 70–78"},"PeriodicalIF":5.5000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Imputation of Missing Data in Materials Science through Nearest Neighbors and Iterative Predictions\",\"authors\":\"Chunhui Xie, Rui Li, Yunqi Li*, Haibo Xie and Qibin Liu, \",\"doi\":\"10.1021/acs.jctc.4c0123710.1021/acs.jctc.4c01237\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science. The imputation-induced errors (IIEs) were evaluated through the difference between imputed and original values, by root mean square error (RMSE), Wasserstein distance (WD), and a newly introduced metrics data set correlation convergence (DCC), to measure the difference at three aspects for individual data, column-wise distribution, and correlation stability of a data set. MatImpute outperformed the others with the least RMSE and WD and the highest DCC. The IIE increases with the increase of data missing ratio and in the order of missing at random < missing completely at random ≤ missing not at random, considering inherent correlations among missing data. A similar trend was observed for the increase of IIE along the central departure distance in units of the standard deviation, which is consistent with the increase of difficulty from interpolation to extrapolation. Further tests of IIE in regression and classification machine learning predictive models, MatImpute also preserved the highest data recovery fidelity. We released the code of MatImpute to facilitate the construction of high-quality data sets in materials science.</p>\",\"PeriodicalId\":45,\"journal\":{\"name\":\"Journal of Chemical Theory and Computation\",\"volume\":\"21 1\",\"pages\":\"70–78 70–78\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Theory and Computation\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jctc.4c01237\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Theory and Computation","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jctc.4c01237","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

表格数据集中的缺失数据在统计分析、大数据分析和机器学习研究中普遍存在。已经提出了许多策略来估算缺失的数据，但它们的可靠性尚未在材料科学中得到严格评估。在这里，我们在材料科学的7个代表性数据集上对Mean、MissForest、HyperImpute、Gain、Sinkhorn和新提出的MatImpute六种imputation策略进行了基准测试。通过均方根误差（RMSE）、沃瑟斯坦距离（WD）和新引入的度量数据集相关收敛（DCC）来评估输入误差与原始值之间的差异，以衡量单个数据、列向分布和数据集相关稳定性三个方面的差异。MatImpute的RMSE和WD最小，DCC最高。IIE随数据缺失率的增加和随机缺失的顺序而增大；考虑到缺失数据之间的内在相关性，完全随机缺失≤非随机缺失。在以标准差为单位的中心偏离距离上，IIE也有类似的增加趋势，这与从插值到外推的难度增加是一致的。IIE在回归和分类机器学习预测模型中的进一步测试表明，MatImpute也保留了最高的数据恢复保真度。我们发布了MatImpute的代码，以方便材料科学领域高质量数据集的构建。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Imputation of Missing Data in Materials Science through Nearest Neighbors and Iterative Predictions

查看原文本刊更多论文

Imputation of Missing Data in Materials Science through Nearest Neighbors and Iterative Predictions

Missing data in tabular data sets is ubiquitous in statistical analysis, big data analysis, and machine learning studies. Many strategies have been proposed to impute missing data, but their reliability has not been stringently assessed in materials science. Here, we carried out a benchmark test for six imputation strategies: Mean, MissForest, HyperImpute, Gain, Sinkhorn, and a newly proposed MatImpute on seven representative data sets in materials science. The imputation-induced errors (IIEs) were evaluated through the difference between imputed and original values, by root mean square error (RMSE), Wasserstein distance (WD), and a newly introduced metrics data set correlation convergence (DCC), to measure the difference at three aspects for individual data, column-wise distribution, and correlation stability of a data set. MatImpute outperformed the others with the least RMSE and WD and the highest DCC. The IIE increases with the increase of data missing ratio and in the order of missing at random < missing completely at random ≤ missing not at random, considering inherent correlations among missing data. A similar trend was observed for the increase of IIE along the central departure distance in units of the standard deviation, which is consistent with the increase of difficulty from interpolation to extrapolation. Further tests of IIE in regression and classification machine learning predictive models, MatImpute also preserved the highest data recovery fidelity. We released the code of MatImpute to facilitate the construction of high-quality data sets in materials science.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Theory and Computation 化学-物理：原子、分子和化学物理

CiteScore

9.90

自引率

16.40%

发文量

568

审稿时长

1 months

期刊介绍： The Journal of Chemical Theory and Computation invites new and original contributions with the understanding that, if accepted, they will not be published elsewhere. Papers reporting new theories, methodology, and/or important applications in quantum electronic structure, molecular dynamics, and statistical mechanics are appropriate for submission to this Journal. Specific topics include advances in or applications of ab initio quantum mechanics, density functional theory, design and properties of new materials, surface science, Monte Carlo simulations, solvation models, QM/MM calculations, biomolecular structure prediction, and molecular dynamics in the broadest sense including gas-phase dynamics, ab initio dynamics, biomolecular dynamics, and protein folding. The Journal does not consider papers that are straightforward applications of known methods including DFT and molecular dynamics. The Journal favors submissions that include advances in theory or methodology with applications to compelling problems.