估计偏差检测的p值

2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops Pub Date : 2014-09-08 DOI:10.1109/SASO.2014.22

Thorsteinn S. Rögnvaldsson, Henrik Norrman, S. Byttner, E. Järpe

{"title":"估计偏差检测的p值","authors":"Thorsteinn S. Rögnvaldsson, Henrik Norrman, S. Byttner, E. Järpe","doi":"10.1109/SASO.2014.22","DOIUrl":null,"url":null,"abstract":"Deviation detection is important for self-monitoring systems. To perform deviation detection well requires methods that, given only \"normal\" data from a distribution of unknown parametric form, can produce a reliable statistic for rejecting the null hypothesis, i.e. evidence for devating data. One measure of the strength of this evidence based on the data is the p-value, but few deviation detection methods utilize p-value estimation. We compare three methods that can be used to produce p-values: one class support vector machine (OCSVM), conformal anomaly detection (CAD), and a simple \"most central pattern\" (MCP) algorithm. The SVM and the CAD method should be able to handle a distribution of any shape. The methods are evaluated on synthetic data sets to test and illustrate their strengths and weaknesses, and on data from a real life self-monitoring scenario with a city bus fleet in normal traffic. The OCSVM has a Gaussian kernel for the synthetic data and a Hellinger kernel for the empirical data. The MCP method uses the Mahalanobis metric for the synthetic data and the Hellinger metric for the empirical data. The CAD uses the same metrics as the MCP method and has a k-nearest neighbour (kNN) non-conformity measure for both sets. The conclusion is that all three methods give reasonable, and quite similar, results on the real life data set but that they have clear strengths and weaknesses on the synthetic data sets. The MCP algorithm is quick and accurate when the \"normal\" data distribution is unimodal and symmetric (with the chosen metric) but not otherwise. The OCSVM is a bit cumbersome to use to create (quantized) p-values but is accurate and reliable when the data distribution is multimodal and asymmetric. The CAD is also accurate for multimodal and asymmetric distributions. The experiment on the vehicle data illustrate how algorithms like these can be used in a self-monitoring system that uses a fleet of vehicles to conduct deviation detection without supervision and without prior knowledge about what is being monitored.","PeriodicalId":6458,"journal":{"name":"2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops","volume":"94 1","pages":"100-109"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Estimating p-Values for Deviation Detection\",\"authors\":\"Thorsteinn S. Rögnvaldsson, Henrik Norrman, S. Byttner, E. Järpe\",\"doi\":\"10.1109/SASO.2014.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deviation detection is important for self-monitoring systems. To perform deviation detection well requires methods that, given only \\\"normal\\\" data from a distribution of unknown parametric form, can produce a reliable statistic for rejecting the null hypothesis, i.e. evidence for devating data. One measure of the strength of this evidence based on the data is the p-value, but few deviation detection methods utilize p-value estimation. We compare three methods that can be used to produce p-values: one class support vector machine (OCSVM), conformal anomaly detection (CAD), and a simple \\\"most central pattern\\\" (MCP) algorithm. The SVM and the CAD method should be able to handle a distribution of any shape. The methods are evaluated on synthetic data sets to test and illustrate their strengths and weaknesses, and on data from a real life self-monitoring scenario with a city bus fleet in normal traffic. The OCSVM has a Gaussian kernel for the synthetic data and a Hellinger kernel for the empirical data. The MCP method uses the Mahalanobis metric for the synthetic data and the Hellinger metric for the empirical data. The CAD uses the same metrics as the MCP method and has a k-nearest neighbour (kNN) non-conformity measure for both sets. The conclusion is that all three methods give reasonable, and quite similar, results on the real life data set but that they have clear strengths and weaknesses on the synthetic data sets. The MCP algorithm is quick and accurate when the \\\"normal\\\" data distribution is unimodal and symmetric (with the chosen metric) but not otherwise. The OCSVM is a bit cumbersome to use to create (quantized) p-values but is accurate and reliable when the data distribution is multimodal and asymmetric. The CAD is also accurate for multimodal and asymmetric distributions. The experiment on the vehicle data illustrate how algorithms like these can be used in a self-monitoring system that uses a fleet of vehicles to conduct deviation detection without supervision and without prior knowledge about what is being monitored.\",\"PeriodicalId\":6458,\"journal\":{\"name\":\"2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops\",\"volume\":\"94 1\",\"pages\":\"100-109\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SASO.2014.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SASO.2014.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

偏差检测是自监测系统的重要组成部分。要很好地进行偏差检测，需要的方法是，只给出来自未知参数形式分布的“正态”数据，就能产生可靠的统计量来拒绝零假设，即偏离数据的证据。基于数据的证据强度的一个度量是p值，但很少有偏差检测方法使用p值估计。我们比较了三种可用于产生p值的方法:一类支持向量机(OCSVM)、共形异常检测(CAD)和简单的“最中心模式”(MCP)算法。支持向量机和CAD方法应该能够处理任何形状的分布。在综合数据集上对这些方法进行了评估，以测试和说明它们的优点和缺点，并在正常交通的城市公交车队的真实生活自我监控场景中进行了数据评估。OCSVM对合成数据具有高斯核，对经验数据具有海灵格核。MCP方法对合成数据使用马氏度规，对经验数据使用海灵格度规。CAD使用与MCP方法相同的度量，并且对两组都有k近邻(kNN)不合格度量。结论是，这三种方法在真实数据集上给出了合理且非常相似的结果，但它们在合成数据集上有明显的优势和劣势。当“正态”数据分布是单峰和对称(与所选度量)时，MCP算法是快速和准确的，而不是其他情况。使用OCSVM创建(量化)p值有点麻烦，但当数据分布是多模态和不对称时，OCSVM是准确和可靠的。CAD对于多模态分布和不对称分布也是准确的。对车辆数据的实验说明了这样的算法如何在一个自我监控系统中使用，该系统使用车队在没有监督的情况下进行偏差检测，并且事先不知道被监控的是什么。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimating p-Values for Deviation Detection

Deviation detection is important for self-monitoring systems. To perform deviation detection well requires methods that, given only "normal" data from a distribution of unknown parametric form, can produce a reliable statistic for rejecting the null hypothesis, i.e. evidence for devating data. One measure of the strength of this evidence based on the data is the p-value, but few deviation detection methods utilize p-value estimation. We compare three methods that can be used to produce p-values: one class support vector machine (OCSVM), conformal anomaly detection (CAD), and a simple "most central pattern" (MCP) algorithm. The SVM and the CAD method should be able to handle a distribution of any shape. The methods are evaluated on synthetic data sets to test and illustrate their strengths and weaknesses, and on data from a real life self-monitoring scenario with a city bus fleet in normal traffic. The OCSVM has a Gaussian kernel for the synthetic data and a Hellinger kernel for the empirical data. The MCP method uses the Mahalanobis metric for the synthetic data and the Hellinger metric for the empirical data. The CAD uses the same metrics as the MCP method and has a k-nearest neighbour (kNN) non-conformity measure for both sets. The conclusion is that all three methods give reasonable, and quite similar, results on the real life data set but that they have clear strengths and weaknesses on the synthetic data sets. The MCP algorithm is quick and accurate when the "normal" data distribution is unimodal and symmetric (with the chosen metric) but not otherwise. The OCSVM is a bit cumbersome to use to create (quantized) p-values but is accurate and reliable when the data distribution is multimodal and asymmetric. The CAD is also accurate for multimodal and asymmetric distributions. The experiment on the vehicle data illustrate how algorithms like these can be used in a self-monitoring system that uses a fleet of vehicles to conduct deviation detection without supervision and without prior knowledge about what is being monitored.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems Workshops

自引率

0.00%

发文量