Cellwise outlier detection in heterogeneous populations

Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar
{"title":"Cellwise outlier detection in heterogeneous populations","authors":"Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar","doi":"arxiv-2409.07881","DOIUrl":null,"url":null,"abstract":"Real-world applications may be affected by outlying values. In the\nmodel-based clustering literature, several methodologies have been proposed to\ndetect units that deviate from the majority of the data (rowwise outliers) and\ntrim them from the parameter estimates. However, the discarded observations can\nencompass valuable information in some observed features. Following the more\nrecent cellwise contamination paradigm, we introduce a Gaussian mixture model\nfor cellwise outlier detection. The proposal is estimated via an\nExpectation-Maximization (EM) algorithm with an additional step for flagging\nthe contaminated cells of a data matrix and then imputing -- instead of\ndiscarding -- them before the parameter estimation. This procedure adheres to\nthe spirit of the EM algorithm by treating the contaminated cells as missing\nvalues. We analyze the performance of the proposed model in comparison with\nother existing methodologies through a simulation study with different\nscenarios and illustrate its potential use for clustering, outlier detection,\nand imputation on three real data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07881","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Real-world applications may be affected by outlying values. In the model-based clustering literature, several methodologies have been proposed to detect units that deviate from the majority of the data (rowwise outliers) and trim them from the parameter estimates. However, the discarded observations can encompass valuable information in some observed features. Following the more recent cellwise contamination paradigm, we introduce a Gaussian mixture model for cellwise outlier detection. The proposal is estimated via an Expectation-Maximization (EM) algorithm with an additional step for flagging the contaminated cells of a data matrix and then imputing -- instead of discarding -- them before the parameter estimation. This procedure adheres to the spirit of the EM algorithm by treating the contaminated cells as missing values. We analyze the performance of the proposed model in comparison with other existing methodologies through a simulation study with different scenarios and illustrate its potential use for clustering, outlier detection, and imputation on three real data sets.
异质群体中的细胞离群点检测
现实世界的应用可能会受到离群值的影响。在基于模型的聚类文献中,已经提出了几种方法来检测偏离大多数数据的单元(纵向离群值),并将其从参数估计中删除。然而,这些被丢弃的观测数据可能包含了某些观测特征的有价值信息。根据最近的单元污染范例,我们引入了一种高斯混合物模型用于单元离群值检测。该建议通过期望最大化(EM)算法进行估计,并在参数估计前增加了一个步骤,即标记数据矩阵中受污染的单元,然后将其归入(而不是丢弃)。这一过程秉承了 EM 算法的精神,将受污染的单元格视为缺失值。我们通过对不同情况的模拟研究,分析了所提模型与其他现有方法的性能比较,并在三个真实数据集上说明了该模型在聚类、离群点检测和估算方面的潜在用途。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信