{"title":"Analysis of Machine Learning Prediction Reliability based on Sampling Distance Evaluation with Feature Decorrelation","authors":"evan askanazi, Ilya Grinberg","doi":"10.1088/2632-2153/ad4231","DOIUrl":null,"url":null,"abstract":"\n Despite successful use in a wide variety of disciplines for data analysis and prediction, machine learning (ML) methods suffer from a lack of understanding of the reliability of predictions due to the lack of transparency and black-box nature of ML models. In materials science and other fields, typical ML model results include a significant number of low-quality predictions. This problem is known to be particularly acute for target systems which differ significantly from the data used for ML model training. However, to date, a general method for uncertainty quantification (UQ) of ML predictions has not been available. Focusing on the intuitive and computationally efficient similarity-based UQ, we show that a simple metric based on Euclidean feature space distance and sampling density together with the decorrelation of the features using Gram-Schmidt orthogonalization allows effective separation of the accurately predicted data points from data points with poor prediction accuracy. To demonstrate the generality of the method, we apply it to support vector regression models for various small data sets for materials science and other fields. We also show that the proposed metric is a more effective UQ tool than the standard approach of using the average distance of k nearest neighbors (k=1-10) in features space for similarity evaluation. Our method is computationally simple, can be used with any ML learning method and enables analysis of the sources of the ML prediction errors. Therefore, it is suitable for use as a standard technique for the estimation of ML prediction reliability for small data sets and as a tool for data set design.","PeriodicalId":503691,"journal":{"name":"Machine Learning: Science and Technology","volume":"28 41","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning: Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2632-2153/ad4231","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite successful use in a wide variety of disciplines for data analysis and prediction, machine learning (ML) methods suffer from a lack of understanding of the reliability of predictions due to the lack of transparency and black-box nature of ML models. In materials science and other fields, typical ML model results include a significant number of low-quality predictions. This problem is known to be particularly acute for target systems which differ significantly from the data used for ML model training. However, to date, a general method for uncertainty quantification (UQ) of ML predictions has not been available. Focusing on the intuitive and computationally efficient similarity-based UQ, we show that a simple metric based on Euclidean feature space distance and sampling density together with the decorrelation of the features using Gram-Schmidt orthogonalization allows effective separation of the accurately predicted data points from data points with poor prediction accuracy. To demonstrate the generality of the method, we apply it to support vector regression models for various small data sets for materials science and other fields. We also show that the proposed metric is a more effective UQ tool than the standard approach of using the average distance of k nearest neighbors (k=1-10) in features space for similarity evaluation. Our method is computationally simple, can be used with any ML learning method and enables analysis of the sources of the ML prediction errors. Therefore, it is suitable for use as a standard technique for the estimation of ML prediction reliability for small data sets and as a tool for data set design.
尽管机器学习(ML)方法成功地应用于各种学科的数据分析和预测,但由于 ML 模型缺乏透明度和黑箱性质,人们对预测的可靠性缺乏了解。在材料科学和其他领域,典型的 ML 模型结果包括大量低质量预测。众所周知,这一问题在目标系统中尤为突出,因为目标系统与用于 ML 模型训练的数据存在很大差异。然而,迄今为止,还没有一种对 ML 预测进行不确定性量化(UQ)的通用方法。我们重点研究了直观且计算效率高的基于相似性的不确定性量化方法,结果表明,基于欧氏特征空间距离和采样密度的简单度量,加上使用格拉姆-施密特正交化对特征进行去相关处理,可以有效地将预测准确的数据点与预测准确性较差的数据点区分开来。为了证明该方法的通用性,我们将其应用于材料科学和其他领域各种小型数据集的支持向量回归模型。我们还表明,与使用特征空间中 k 个近邻(k=1-10)的平均距离进行相似性评估的标准方法相比,所提出的度量方法是一种更有效的 UQ 工具。我们的方法计算简单,可与任何 ML 学习方法一起使用,并能分析 ML 预测误差的来源。因此,它适合用作估算小型数据集 ML 预测可靠性的标准技术,以及数据集设计的工具。