预测模型中异常值的非去除策略:PAELLA算法案例

Log. J. IGPL Pub Date : 2019-12-09 DOI:10.1093/jigpal/jzz052

M. C. Limas, H. Alaiz-Moretón, Laura Fernández-Robles, Javier Alfonso-Cendón, C. F. Llamas, Lidia Sánchez-González, H. Pérez

{"title":"预测模型中异常值的非去除策略:PAELLA算法案例","authors":"M. C. Limas, H. Alaiz-Moretón, Laura Fernández-Robles, Javier Alfonso-Cendón, C. F. Llamas, Lidia Sánchez-González, H. Pérez","doi":"10.1093/jigpal/jzz052","DOIUrl":null,"url":null,"abstract":"\n This paper reports the experience of using the PAELLA algorithm as a helper tool in robust regression instead of as originally intended for outlier identification and removal. This novel usage of the algorithm takes advantage of the occurrence vector calculated by the algorithm in order to strengthen the effect of the more reliable samples and lessen the impact of those that otherwise would be considered outliers. Following that aim, a series of experiments is conducted in order to learn how to better use the information contained in the occurrence vector. Using a contrively difficult artificial data set, a reference predictive model is fit using the whole raw dataset. The second experiment reports the results of fitting a similar predictive model but discarding the samples marked as outliers by PAELLA. The third experiment uses the occurrence vector provided by PAELLA in order to classify the observations in multiple bins and fit every possible model changing which bins are considered for fitting and which are discarded in that particular model. The fourth experiment introduces a sampling process before fitting in which the occurrence vector represents the likelihood of being considered in the training data set. The fifth experiment considers the sampling process as an internal step to be performed interleaved between the training epochs. The last experiment compares our approach using weighted neural networks to a state of the art method.","PeriodicalId":304915,"journal":{"name":"Log. J. IGPL","volume":"304 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-removal strategy for outliers in predictive models: The PAELLA algorithm case\",\"authors\":\"M. C. Limas, H. Alaiz-Moretón, Laura Fernández-Robles, Javier Alfonso-Cendón, C. F. Llamas, Lidia Sánchez-González, H. Pérez\",\"doi\":\"10.1093/jigpal/jzz052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n This paper reports the experience of using the PAELLA algorithm as a helper tool in robust regression instead of as originally intended for outlier identification and removal. This novel usage of the algorithm takes advantage of the occurrence vector calculated by the algorithm in order to strengthen the effect of the more reliable samples and lessen the impact of those that otherwise would be considered outliers. Following that aim, a series of experiments is conducted in order to learn how to better use the information contained in the occurrence vector. Using a contrively difficult artificial data set, a reference predictive model is fit using the whole raw dataset. The second experiment reports the results of fitting a similar predictive model but discarding the samples marked as outliers by PAELLA. The third experiment uses the occurrence vector provided by PAELLA in order to classify the observations in multiple bins and fit every possible model changing which bins are considered for fitting and which are discarded in that particular model. The fourth experiment introduces a sampling process before fitting in which the occurrence vector represents the likelihood of being considered in the training data set. The fifth experiment considers the sampling process as an internal step to be performed interleaved between the training epochs. The last experiment compares our approach using weighted neural networks to a state of the art method.\",\"PeriodicalId\":304915,\"journal\":{\"name\":\"Log. J. IGPL\",\"volume\":\"304 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Log. J. IGPL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jigpal/jzz052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Log. J. IGPL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jigpal/jzz052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文报告了使用PAELLA算法作为鲁棒回归的辅助工具而不是最初用于异常值识别和去除的经验。这种算法的新用法利用了算法计算的发生向量，以加强更可靠的样本的效果，并减少那些被认为是异常值的样本的影响。为了实现这一目标，我们进行了一系列的实验，以学习如何更好地利用发生向量中包含的信息。使用一个人为的困难的人工数据集，一个参考预测模型拟合使用整个原始数据集。第二个实验报告了拟合类似预测模型的结果，但丢弃了PAELLA标记为异常值的样本。第三个实验使用PAELLA提供的发生向量，以便对多个bin中的观测进行分类，并拟合每个可能的模型，改变在该特定模型中考虑拟合的bin和丢弃的bin。第四个实验在拟合之前引入了一个采样过程，其中发生向量表示训练数据集中被考虑的可能性。第五项实验将采样过程视为一个内部步骤，在训练时期之间交错进行。最后一个实验将我们使用加权神经网络的方法与最先进的方法进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Non-removal strategy for outliers in predictive models: The PAELLA algorithm case

This paper reports the experience of using the PAELLA algorithm as a helper tool in robust regression instead of as originally intended for outlier identification and removal. This novel usage of the algorithm takes advantage of the occurrence vector calculated by the algorithm in order to strengthen the effect of the more reliable samples and lessen the impact of those that otherwise would be considered outliers. Following that aim, a series of experiments is conducted in order to learn how to better use the information contained in the occurrence vector. Using a contrively difficult artificial data set, a reference predictive model is fit using the whole raw dataset. The second experiment reports the results of fitting a similar predictive model but discarding the samples marked as outliers by PAELLA. The third experiment uses the occurrence vector provided by PAELLA in order to classify the observations in multiple bins and fit every possible model changing which bins are considered for fitting and which are discarded in that particular model. The fourth experiment introduces a sampling process before fitting in which the occurrence vector represents the likelihood of being considered in the training data set. The fifth experiment considers the sampling process as an internal step to be performed interleaved between the training epochs. The last experiment compares our approach using weighted neural networks to a state of the art method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Log. J. IGPL

自引率

0.00%

发文量