What Is the Outlier—Consistent Outlier or Inconsistent Outlier?

IF 4.1 Q2 CHEMISTRY, ANALYTICAL

Analytical science advances Pub Date : 2025-07-24 DOI:10.1002/ansa.70030

Hiromasa Kaneko

{"title":"What Is the Outlier—Consistent Outlier or Inconsistent Outlier?","authors":"Hiromasa Kaneko","doi":"10.1002/ansa.70030","DOIUrl":null,"url":null,"abstract":"In the design of molecules, materials and processes, outliers or outlier samples can be included in a dataset when performing machine learning or regression analysis. Although outlier samples with high prediction errors in regression analysis have been divided into bad leverage points and vertical outliers (good leverage points have low prediction errors), this study classifies the outlier samples into consistent outliers (CO) and inconsistent outliers (ICO) for a detailed discussion of outlier samples and their effective utilisation. The relationship between the explanatory variables (x) and dependent variables (y) is consistent with the other samples for CO but not for ICO. Furthermore, an index of ICO-likeness based on triple cross-validation and the mean absolute error is proposed, and a method to determine whether an outlier sample is an ICO or a CO is developed. Data analysis using numerical simulation datasets and a compound dataset with boiling points confirms that the proposed method can appropriately discriminate between ICO and CO. When an outlier sample is determined to be an ICO, the errors in x and y should be checked first for the sample. If no errors exist in x and y, a new x should be added to explain y of the ICO. When an outlier sample is determined to be CO, it is expected that exploring the extrapolation from CO in x will further improve the y values using a model that includes CO.","PeriodicalId":93411,"journal":{"name":"Analytical science advances","volume":"6 2","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ansa.70030","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical science advances","FirstCategoryId":"1085","ListUrlMain":"https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/ansa.70030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

Abstract

In the design of molecules, materials and processes, outliers or outlier samples can be included in a dataset when performing machine learning or regression analysis. Although outlier samples with high prediction errors in regression analysis have been divided into bad leverage points and vertical outliers (good leverage points have low prediction errors), this study classifies the outlier samples into consistent outliers (CO) and inconsistent outliers (ICO) for a detailed discussion of outlier samples and their effective utilisation. The relationship between the explanatory variables (x) and dependent variables (y) is consistent with the other samples for CO but not for ICO. Furthermore, an index of ICO-likeness based on triple cross-validation and the mean absolute error is proposed, and a method to determine whether an outlier sample is an ICO or a CO is developed. Data analysis using numerical simulation datasets and a compound dataset with boiling points confirms that the proposed method can appropriately discriminate between ICO and CO. When an outlier sample is determined to be an ICO, the errors in x and y should be checked first for the sample. If no errors exist in x and y, a new x should be added to explain y of the ICO. When an outlier sample is determined to be CO, it is expected that exploring the extrapolation from CO in x will further improve the y values using a model that includes CO.

Abstract Image

查看原文本刊更多论文

什么是异常值——一致的异常值还是不一致的异常值？

在分子、材料和工艺的设计中，在进行机器学习或回归分析时，可以将异常值或异常样本包含在数据集中。虽然回归分析中预测误差较大的离群样本已经分为不良杠杆点和垂直离群点（良好杠杆点预测误差较小），但本研究将离群样本分为一致性离群点（CO）和非一致性离群点（ICO），详细讨论离群样本及其有效利用。解释变量(x)与因变量(y)之间的关系与CO的其他样本一致，但与ICO不一致。在此基础上，提出了一种基于三重交叉验证和平均绝对误差的ICO相似性指标，并提出了一种判别离群样本是ICO还是CO的方法。使用数值模拟数据集和具有沸点的复合数据集进行数据分析，证实了所提出的方法可以适当地区分ICO和CO。当确定异常样本为ICO时，应首先检查样本中的x和y误差。如果x和y中没有错误，则需要添加一个新的x来解释ICO的y。当一个离群样本被确定为CO时，可以期望通过使用包含CO的模型来探索x中CO的外推，从而进一步改善y值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Analytical science advances

CiteScore

4.60

自引率

0.00%

发文量