Identifying Multiple Outliers in Multivariate Data

A. Hadi
{"title":"Identifying Multiple Outliers in Multivariate Data","authors":"A. Hadi","doi":"10.1111/J.2517-6161.1992.TB01449.X","DOIUrl":null,"url":null,"abstract":"SUMMARY We propose a procedure for the detection of multiple outliers in multivariate data. Let Xbe an n x p data matrix representing n observations onp variates. We first order the n observations, using an appropriately chosen robust measure of outlyingness, then divide the data set into two initial subsets: a 'basic' subset which containsp + 1 'good' observations and a 'nonbasic' subset which contains the remaining n -p - 1 observations. Second, we compute the relative distance from each point in the data set to the centre of the basic subset, relative to the (possibly singular) covariance matrix of the basic subset. Third, we rearrange the n observations in ascending order accordingly, then divide the data set into two subsets: a basic subset which contains the first p +2 observations and a non-basic subset which contains the remaining n -p -2 observations. This process is repeated until an appropriately chosen stopping criterion is met. The final non-basic subset of observations is declared an outlying subset. The procedure proposed is illustrated and compared with existing methods by using several data sets. The procedure is simple, computationally inexpensive, suitable for automation, computable with widely available software packages, effective in dealing with masking and swamping problems and, most importantly, successful in identifying multivariate outliers.","PeriodicalId":17425,"journal":{"name":"Journal of the royal statistical society series b-methodological","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1992-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"816","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the royal statistical society series b-methodological","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/J.2517-6161.1992.TB01449.X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 816

Abstract

SUMMARY We propose a procedure for the detection of multiple outliers in multivariate data. Let Xbe an n x p data matrix representing n observations onp variates. We first order the n observations, using an appropriately chosen robust measure of outlyingness, then divide the data set into two initial subsets: a 'basic' subset which containsp + 1 'good' observations and a 'nonbasic' subset which contains the remaining n -p - 1 observations. Second, we compute the relative distance from each point in the data set to the centre of the basic subset, relative to the (possibly singular) covariance matrix of the basic subset. Third, we rearrange the n observations in ascending order accordingly, then divide the data set into two subsets: a basic subset which contains the first p +2 observations and a non-basic subset which contains the remaining n -p -2 observations. This process is repeated until an appropriately chosen stopping criterion is met. The final non-basic subset of observations is declared an outlying subset. The procedure proposed is illustrated and compared with existing methods by using several data sets. The procedure is simple, computationally inexpensive, suitable for automation, computable with widely available software packages, effective in dealing with masking and swamping problems and, most importantly, successful in identifying multivariate outliers.
识别多变量数据中的多个异常值
我们提出了一种在多变量数据中检测多个异常值的方法。设x是一个n × p的数据矩阵,表示n个观测值和p个变量。我们首先对n个观测值进行排序,使用适当选择的鲁棒性度量,然后将数据集分为两个初始子集:包含p + 1个“良好”观测值的“基本”子集和包含剩余n -p - 1个观测值的“非基本”子集。其次,我们计算数据集中每个点到基本子集中心的相对距离,相对于基本子集的协方差矩阵(可能是奇异的)。第三,我们将n个观测值按升序重新排列,然后将数据集划分为两个子集:包含前p +2个观测值的基本子集和包含剩余n -p -2个观测值的非基本子集。这个过程不断重复,直到满足适当选择的停止标准。最后的非基本子集被声明为外围子集。通过几个数据集,说明了所提出的方法,并与现有方法进行了比较。该过程简单,计算成本低,适合自动化,可使用广泛可用的软件包计算,有效地处理掩蔽和淹没问题,最重要的是,成功地识别多变量异常值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信