Robust Estimation of the Covariance Matrix From Data With Outliers

IF 2.7 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2024-10-02 DOI:10.1109/OJSP.2024.3473610

Petre Stoica;Prabhu Babu;Piyush Varshney

{"title":"Robust Estimation of the Covariance Matrix From Data With Outliers","authors":"Petre Stoica;Prabhu Babu;Piyush Varshney","doi":"10.1109/OJSP.2024.3473610","DOIUrl":null,"url":null,"abstract":"The robust estimation of the covariance matrix is a frequent task in practical applications in which, more often than not, some data samples are outliers. There are several methods that can be used to robustly estimate a covariance matrix from corrupted data, a representative example of which is the \n<bold>m\ninimum \n<bold>c\novariance \n<bold>d\neterminant (MCD) method. In this paper we present a maximum conditional likelihood interpretation of MCD that provides a new motivation of as well as further insights into this method. To perform at its best MCD requires information on the number of outliers in the data, which usually is not available. We propose two new methods for covariance matrix estimation from data with outliers that do not suffer from this problem: TEST (multiple-hypothesis \n<bold>test\ning method) which uses the FDR (false discovery rate) to test a set of model hypotheses and hence estimate the number of outliers and their locations, and LIKE (penalized \n<bold>like\nlihood method) that solves the outlier estimation problem using a GIC (generalized information criterion) to penalize the complexity of a high-dimensional data model. We show by means of numerical simulations that the performances of TEST and LIKE are relatively similar to one another as well as to the performance of the oracle MCD (which uses the true number of outliers) and significantly better than the performance of MCD that uses an upper bound on the outlier number.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"5 ","pages":"1061-1072"},"PeriodicalIF":2.7000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704043","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10704043/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The robust estimation of the covariance matrix is a frequent task in practical applications in which, more often than not, some data samples are outliers. There are several methods that can be used to robustly estimate a covariance matrix from corrupted data, a representative example of which is the m inimum c ovariance d eterminant (MCD) method. In this paper we present a maximum conditional likelihood interpretation of MCD that provides a new motivation of as well as further insights into this method. To perform at its best MCD requires information on the number of outliers in the data, which usually is not available. We propose two new methods for covariance matrix estimation from data with outliers that do not suffer from this problem: TEST (multiple-hypothesis test ing method) which uses the FDR (false discovery rate) to test a set of model hypotheses and hence estimate the number of outliers and their locations, and LIKE (penalized like lihood method) that solves the outlier estimation problem using a GIC (generalized information criterion) to penalize the complexity of a high-dimensional data model. We show by means of numerical simulations that the performances of TEST and LIKE are relatively similar to one another as well as to the performance of the oracle MCD (which uses the true number of outliers) and significantly better than the performance of MCD that uses an upper bound on the outlier number.

查看原文本刊更多论文

从异常值数据中稳健估计协方差矩阵

协方差矩阵的稳健估计是实际应用中的一项经常性任务，因为在实际应用中，一些数据样本往往是异常值。有几种方法可以用来从损坏的数据中稳健地估计协方差矩阵，其中一个代表性的例子就是最小协方差行列式（MCD）方法。在本文中，我们提出了 MCD 的最大条件似然解释，为这种方法提供了新的动机和进一步的见解。要使 MCD 达到最佳效果，需要获得数据中离群值的数量信息，而这通常是无法获得的。我们提出了两种新方法，用于从有异常值的数据中估计协方差矩阵，它们都不存在这个问题：TEST（多重假设检验方法）使用 FDR（错误发现率）来检验一组模型假设，从而估计异常值的数量及其位置；LIKE（惩罚似然法）使用 GIC（广义信息准则）来解决异常值估计问题，以惩罚高维数据模型的复杂性。我们通过数值模拟表明，TEST 和 LIKE 的性能彼此比较接近，也与神谕 MCD（使用离群值的真实数量）的性能比较接近，而且明显优于使用离群值数量上限的 MCD 的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊