Identifying Multiple Outliers in Multivariate Data

Journal of the royal statistical society series b-methodological Pub Date : 1992-07-01 DOI:10.1111/J.2517-6161.1992.TB01449.X

A. Hadi

{"title":"Identifying Multiple Outliers in Multivariate Data","authors":"A. Hadi","doi":"10.1111/J.2517-6161.1992.TB01449.X","DOIUrl":null,"url":null,"abstract":"SUMMARY We propose a procedure for the detection of multiple outliers in multivariate data. Let Xbe an n x p data matrix representing n observations onp variates. We first order the n observations, using an appropriately chosen robust measure of outlyingness, then divide the data set into two initial subsets: a 'basic' subset which containsp + 1 'good' observations and a 'nonbasic' subset which contains the remaining n -p - 1 observations. Second, we compute the relative distance from each point in the data set to the centre of the basic subset, relative to the (possibly singular) covariance matrix of the basic subset. Third, we rearrange the n observations in ascending order accordingly, then divide the data set into two subsets: a basic subset which contains the first p +2 observations and a non-basic subset which contains the remaining n -p -2 observations. This process is repeated until an appropriately chosen stopping criterion is met. The final non-basic subset of observations is declared an outlying subset. The procedure proposed is illustrated and compared with existing methods by using several data sets. The procedure is simple, computationally inexpensive, suitable for automation, computable with widely available software packages, effective in dealing with masking and swamping problems and, most importantly, successful in identifying multivariate outliers.","PeriodicalId":17425,"journal":{"name":"Journal of the royal statistical society series b-methodological","volume":"45 1","pages":"761-771"},"PeriodicalIF":0.0000,"publicationDate":"1992-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"816","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the royal statistical society series b-methodological","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/J.2517-6161.1992.TB01449.X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 816

Abstract

SUMMARY We propose a procedure for the detection of multiple outliers in multivariate data. Let Xbe an n x p data matrix representing n observations onp variates. We first order the n observations, using an appropriately chosen robust measure of outlyingness, then divide the data set into two initial subsets: a 'basic' subset which containsp + 1 'good' observations and a 'nonbasic' subset which contains the remaining n -p - 1 observations. Second, we compute the relative distance from each point in the data set to the centre of the basic subset, relative to the (possibly singular) covariance matrix of the basic subset. Third, we rearrange the n observations in ascending order accordingly, then divide the data set into two subsets: a basic subset which contains the first p +2 observations and a non-basic subset which contains the remaining n -p -2 observations. This process is repeated until an appropriately chosen stopping criterion is met. The final non-basic subset of observations is declared an outlying subset. The procedure proposed is illustrated and compared with existing methods by using several data sets. The procedure is simple, computationally inexpensive, suitable for automation, computable with widely available software packages, effective in dealing with masking and swamping problems and, most importantly, successful in identifying multivariate outliers.

查看原文本刊更多论文

识别多变量数据中的多个异常值

我们提出了一种在多变量数据中检测多个异常值的方法。设x是一个n × p的数据矩阵，表示n个观测值和p个变量。我们首先对n个观测值进行排序，使用适当选择的鲁棒性度量，然后将数据集分为两个初始子集:包含p + 1个“良好”观测值的“基本”子集和包含剩余n -p - 1个观测值的“非基本”子集。其次，我们计算数据集中每个点到基本子集中心的相对距离，相对于基本子集的协方差矩阵(可能是奇异的)。第三，我们将n个观测值按升序重新排列，然后将数据集划分为两个子集:包含前p +2个观测值的基本子集和包含剩余n -p -2个观测值的非基本子集。这个过程不断重复，直到满足适当选择的停止标准。最后的非基本子集被声明为外围子集。通过几个数据集，说明了所提出的方法，并与现有方法进行了比较。该过程简单，计算成本低，适合自动化，可使用广泛可用的软件包计算，有效地处理掩蔽和淹没问题，最重要的是，成功地识别多变量异常值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the royal statistical society series b-methodological

自引率

0.00%

发文量