{"title":"离群点检测方法的比较评估","authors":"Melis Çelik Güney, Tamer Kayaalp","doi":"10.34248/bsengineering.1387431","DOIUrl":null,"url":null,"abstract":"In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by adding 30 outliers to the data set. The iForest algorithm was found to have higher prediction power than others. In addition, outliers were found in a real data set with the iForest algorithm, and the data set with outliers and without outliers were compared.","PeriodicalId":495872,"journal":{"name":"Black sea journal of engineering and science","volume":"1 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparative Evaluation of the Outlier Detection Methods\",\"authors\":\"Melis Çelik Güney, Tamer Kayaalp\",\"doi\":\"10.34248/bsengineering.1387431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by adding 30 outliers to the data set. The iForest algorithm was found to have higher prediction power than others. In addition, outliers were found in a real data set with the iForest algorithm, and the data set with outliers and without outliers were compared.\",\"PeriodicalId\":495872,\"journal\":{\"name\":\"Black sea journal of engineering and science\",\"volume\":\"1 3\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Black sea journal of engineering and science\",\"FirstCategoryId\":\"0\",\"ListUrlMain\":\"https://doi.org/10.34248/bsengineering.1387431\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Black sea journal of engineering and science","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.34248/bsengineering.1387431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Comparative Evaluation of the Outlier Detection Methods
In data mining, in order to calculate descriptive statistics and other statistical model parameters correctly, outliers should be identified and excluded from the data set before starting data analysis. This paper studied and compared the performance of model-based, density-based, clustering-based, angle-based, and isolation-based outlier detection methods used in data mining. ROC and AUC curves were used to compare the performances of outlier detection methods. A data set with a standard normal distribution and fit a logistic regression was simulated. To compare the methods, the data was modified by adding 30 outliers to the data set. The iForest algorithm was found to have higher prediction power than others. In addition, outliers were found in a real data set with the iForest algorithm, and the data set with outliers and without outliers were compared.