采用裁剪均值鲁棒估计的异质数据离群点检测技术

IF 0.3 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Radio Electronics Computer Science Control Pub Date : 2022-10-16 DOI:10.15588/1607-3274-2022-3-5

A. Shved, Yevhen Davydenko

{"title":"采用裁剪均值鲁棒估计的异质数据离群点检测技术","authors":"A. Shved, Yevhen Davydenko","doi":"10.15588/1607-3274-2022-3-5","DOIUrl":null,"url":null,"abstract":"Context. Fortunately, the most commonly used in parametric statistics assumptions such as such as normality, linearity, independence, are not always fulfilled in real practice. The main reason for this is the appearance of observations in data samples that differ from the bulk of the data, as a result of which the sample becomes heterogeneous. The application in such conditions of generally accepted estimation procedures, for example, the sample mean, entails the bias increasing and the effectiveness decreasing of the estimates obtained. This, in turn, raises the problem of finding possible solutions to the problem of processing data sets that include outliers, especially in small samples. The object of the study is the process of detecting and excluding anomalous objects from the heterogeneous data sets. \nObjective. The goal of the work is to develop a procedure for anomaly detection in heterogeneous data sets, and the rationale for using a number of trimmed-mean robust estimators as a statistical measure of the location parameter of distorted parametric distribution models. \nMethod. The problems of analysis (processing) of heterogeneous data containing outliers, sharply distinguished, suspicious observations are considered. The possibilities of using robust estimation methods for processing heterogeneous data have been analyzed. A procedure for identification and extraction of outliers caused by measurement errors, hidden equipment defects, experimental conditions, etc. has been proposed. The proposed approach is based on the procedure of symmetric and asymmetric truncation of the ranked set obtained from the initial sample of measurement data, based on the methods of robust statistics. For a reasonable choice of the value of the truncation coefficient, it is proposed to use adaptive robust procedures. Observations that fell into the zone of smallest and lowest ordinal statistics are considered outliers. \nResults. The proposed approach allows, in contrast to the traditional criteria for identifying outlying observations, such as the Smirnov (Grubbs) criterion, the Dixon criterion, etc., to split the analyzed set of data into a homogeneous component and identify the set of outlying observations, assuming that their share in the total set of analyzed data is unknown. \nConclusions. The article proposes the use of robust statistics methods for the formation of supposed zones containing homogeneous and outlying observations in the ranked set, built on the basis of the initial sample of the analyzed data. It is proposed to use a complex of adaptive robust procedures to establish the expected truncation levels that form the zones of outlying observations in the region of the lowest and smallest order statistics of the ranked dataset. The final level of truncation of the ranked dataset is refined on the basis of existing criteria that allow checking the boundary observations (minimum and maximum) for outliers.","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"35 1","pages":""},"PeriodicalIF":0.3000,"publicationDate":"2022-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OUTLIER DETECTION TECHNIQUE FOR HETEROGENEOUS DATA USING TRIMMED-MEAN ROBUST ESTIMATORS\",\"authors\":\"A. Shved, Yevhen Davydenko\",\"doi\":\"10.15588/1607-3274-2022-3-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Fortunately, the most commonly used in parametric statistics assumptions such as such as normality, linearity, independence, are not always fulfilled in real practice. The main reason for this is the appearance of observations in data samples that differ from the bulk of the data, as a result of which the sample becomes heterogeneous. The application in such conditions of generally accepted estimation procedures, for example, the sample mean, entails the bias increasing and the effectiveness decreasing of the estimates obtained. This, in turn, raises the problem of finding possible solutions to the problem of processing data sets that include outliers, especially in small samples. The object of the study is the process of detecting and excluding anomalous objects from the heterogeneous data sets. \\nObjective. The goal of the work is to develop a procedure for anomaly detection in heterogeneous data sets, and the rationale for using a number of trimmed-mean robust estimators as a statistical measure of the location parameter of distorted parametric distribution models. \\nMethod. The problems of analysis (processing) of heterogeneous data containing outliers, sharply distinguished, suspicious observations are considered. The possibilities of using robust estimation methods for processing heterogeneous data have been analyzed. A procedure for identification and extraction of outliers caused by measurement errors, hidden equipment defects, experimental conditions, etc. has been proposed. The proposed approach is based on the procedure of symmetric and asymmetric truncation of the ranked set obtained from the initial sample of measurement data, based on the methods of robust statistics. For a reasonable choice of the value of the truncation coefficient, it is proposed to use adaptive robust procedures. Observations that fell into the zone of smallest and lowest ordinal statistics are considered outliers. \\nResults. The proposed approach allows, in contrast to the traditional criteria for identifying outlying observations, such as the Smirnov (Grubbs) criterion, the Dixon criterion, etc., to split the analyzed set of data into a homogeneous component and identify the set of outlying observations, assuming that their share in the total set of analyzed data is unknown. \\nConclusions. The article proposes the use of robust statistics methods for the formation of supposed zones containing homogeneous and outlying observations in the ranked set, built on the basis of the initial sample of the analyzed data. It is proposed to use a complex of adaptive robust procedures to establish the expected truncation levels that form the zones of outlying observations in the region of the lowest and smallest order statistics of the ranked dataset. The final level of truncation of the ranked dataset is refined on the basis of existing criteria that allow checking the boundary observations (minimum and maximum) for outliers.\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"35 1\",\"pages\":\"\"},\"PeriodicalIF\":0.3000,\"publicationDate\":\"2022-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2022-3-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2022-3-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

上下文。幸运的是，参数统计中最常用的假设，如正态性、线性、独立性，在实际实践中并不总是满足。造成这种情况的主要原因是数据样本中的观测值与大部分数据不同，因此样本变得异构。在这种情况下，应用普遍接受的估计程序，例如样本均值，意味着得到的估计的偏差增加和有效性降低。这反过来又提出了一个问题，即寻找可能的解决方案来处理包含异常值的数据集问题，特别是在小样本中。研究的对象是从异构数据集中检测和排除异常对象的过程。目标。这项工作的目标是开发一种在异构数据集中进行异常检测的程序，以及使用一些修剪平均鲁棒估计器作为扭曲参数分布模型的位置参数的统计度量的基本原理。方法。分析(处理)异构数据的问题，包括异常值，明显区分，可疑的观察。分析了使用鲁棒估计方法处理异构数据的可能性。提出了由测量误差、设备隐藏缺陷、实验条件等引起的异常值的识别和提取方法。该方法基于鲁棒统计方法，对测量数据的初始样本进行对称和非对称截断排序集。为了合理选择截断系数的取值，提出采用自适应鲁棒程序。落入最小和最低序数统计区域的观测值被认为是异常值。结果。与传统的识别离群观测值的标准(如Smirnov (Grubbs)准则、Dixon准则等)相比，该方法允许将分析的数据集分割成一个同次成分，并识别离群观测值集，假设它们在分析数据的总集合中的份额是未知的。结论。本文提出使用稳健的统计方法，在分析数据的初始样本的基础上，在排名集中形成包含均匀和离群观测的假设区域。建议使用自适应鲁棒程序的复合体来建立期望截断水平，这些截断水平形成了排序数据集的最低和最小阶统计量区域的外围观测区域。排序数据集的最终截断级别在现有标准的基础上进行改进，这些标准允许检查异常值的边界观测值(最小值和最大值)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

OUTLIER DETECTION TECHNIQUE FOR HETEROGENEOUS DATA USING TRIMMED-MEAN ROBUST ESTIMATORS

Context. Fortunately, the most commonly used in parametric statistics assumptions such as such as normality, linearity, independence, are not always fulfilled in real practice. The main reason for this is the appearance of observations in data samples that differ from the bulk of the data, as a result of which the sample becomes heterogeneous. The application in such conditions of generally accepted estimation procedures, for example, the sample mean, entails the bias increasing and the effectiveness decreasing of the estimates obtained. This, in turn, raises the problem of finding possible solutions to the problem of processing data sets that include outliers, especially in small samples. The object of the study is the process of detecting and excluding anomalous objects from the heterogeneous data sets. Objective. The goal of the work is to develop a procedure for anomaly detection in heterogeneous data sets, and the rationale for using a number of trimmed-mean robust estimators as a statistical measure of the location parameter of distorted parametric distribution models. Method. The problems of analysis (processing) of heterogeneous data containing outliers, sharply distinguished, suspicious observations are considered. The possibilities of using robust estimation methods for processing heterogeneous data have been analyzed. A procedure for identification and extraction of outliers caused by measurement errors, hidden equipment defects, experimental conditions, etc. has been proposed. The proposed approach is based on the procedure of symmetric and asymmetric truncation of the ranked set obtained from the initial sample of measurement data, based on the methods of robust statistics. For a reasonable choice of the value of the truncation coefficient, it is proposed to use adaptive robust procedures. Observations that fell into the zone of smallest and lowest ordinal statistics are considered outliers. Results. The proposed approach allows, in contrast to the traditional criteria for identifying outlying observations, such as the Smirnov (Grubbs) criterion, the Dixon criterion, etc., to split the analyzed set of data into a homogeneous component and identify the set of outlying observations, assuming that their share in the total set of analyzed data is unknown. Conclusions. The article proposes the use of robust statistics methods for the formation of supposed zones containing homogeneous and outlying observations in the ranked set, built on the basis of the initial sample of the analyzed data. It is proposed to use a complex of adaptive robust procedures to establish the expected truncation levels that form the zones of outlying observations in the region of the lowest and smallest order statistics of the ranked dataset. The final level of truncation of the ranked dataset is refined on the basis of existing criteria that allow checking the boundary observations (minimum and maximum) for outliers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

自引率

20.00%

发文量

审稿时长

12 weeks