通过消除离群值和标准偏差形成密集聚类的新方法

2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT) Pub Date : 2024-01-11 DOI:10.1109/ICAECT60202.2024.10468952

Pushkar Joglekar, Tejaswini Katale, Aishwarya Katale, Surabhi Deshpande, Aarya Nirgude, Aakash Chotrani

{"title":"通过消除离群值和标准偏差形成密集聚类的新方法","authors":"Pushkar Joglekar, Tejaswini Katale, Aishwarya Katale, Surabhi Deshpande, Aarya Nirgude, Aakash Chotrani","doi":"10.1109/ICAECT60202.2024.10468952","DOIUrl":null,"url":null,"abstract":"Outliers are the uncommon data points which deviate from the majority of the Data from a dataset. Presence of Outliers can affect the model’s performance, leading to incorrect data analysis. Hence, identifying and eliminating Outliers is a crucial pre-processing step. This research paper suggests a method for removing outliers that takes standard deviation into account. Standard Deviation is a statistical measure which measures the dispersion within the dataset. In the proposed algorithm, the first step is to calculate Standard Deviations of all the features within the Dataset. Next, the feature with highest Standard Deviation is chosen. After normalization of this column, individual Standardized values for the data points are calculated from the standardized Median. Furthermore, these values are arranged in the ascending order. Selecting closest left 85% and right 85% values from the Standardized Median. For the remaining features, only those observations are selected which are corresponding to the above selected range of data points. To check the efficacy of this algorithm, it is implemented on 5 Standard datasets - Iris Species, Pima Diabetes Dataset, College Dataset, Seattle Weather, Water Quality Dataset. After elimination of Outliers, the proposed algorithm aims to form dense clusters. When compared with K-means clustering, for all the 5 datasets, it gives a better Silhouette Score. The highest score of 0.7 is for Iris Species and the highest difference of 0.49 between the Silhouette score of K-means and the proposed algorithm is for Water Quality Dataset.","PeriodicalId":518900,"journal":{"name":"2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)","volume":"29 4","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel approach for formation of Dense Clusters by Outlier Elimination and Standard Deviation\",\"authors\":\"Pushkar Joglekar, Tejaswini Katale, Aishwarya Katale, Surabhi Deshpande, Aarya Nirgude, Aakash Chotrani\",\"doi\":\"10.1109/ICAECT60202.2024.10468952\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Outliers are the uncommon data points which deviate from the majority of the Data from a dataset. Presence of Outliers can affect the model’s performance, leading to incorrect data analysis. Hence, identifying and eliminating Outliers is a crucial pre-processing step. This research paper suggests a method for removing outliers that takes standard deviation into account. Standard Deviation is a statistical measure which measures the dispersion within the dataset. In the proposed algorithm, the first step is to calculate Standard Deviations of all the features within the Dataset. Next, the feature with highest Standard Deviation is chosen. After normalization of this column, individual Standardized values for the data points are calculated from the standardized Median. Furthermore, these values are arranged in the ascending order. Selecting closest left 85% and right 85% values from the Standardized Median. For the remaining features, only those observations are selected which are corresponding to the above selected range of data points. To check the efficacy of this algorithm, it is implemented on 5 Standard datasets - Iris Species, Pima Diabetes Dataset, College Dataset, Seattle Weather, Water Quality Dataset. After elimination of Outliers, the proposed algorithm aims to form dense clusters. When compared with K-means clustering, for all the 5 datasets, it gives a better Silhouette Score. The highest score of 0.7 is for Iris Species and the highest difference of 0.49 between the Silhouette score of K-means and the proposed algorithm is for Water Quality Dataset.\",\"PeriodicalId\":518900,\"journal\":{\"name\":\"2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)\",\"volume\":\"29 4\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAECT60202.2024.10468952\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAECT60202.2024.10468952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

异常值是指偏离数据集中大多数数据的不常见数据点。异常值的存在会影响模型的性能，导致错误的数据分析。因此，识别和消除异常值是一个关键的预处理步骤。本研究论文提出了一种将标准偏差考虑在内的消除异常值的方法。标准偏差是一种统计量，用于衡量数据集内部的离散程度。在建议的算法中，第一步是计算数据集中所有特征的标准偏差。然后，选择标准偏差最大的特征。对这一列进行归一化处理后，根据标准化中值计算出数据点的各个标准化值。此外，这些值按升序排列。从标准化中值中选择最接近的左侧 85% 和右侧 85% 值。对于其余特征，只选择与上述所选数据点范围相对应的观测值。为了检验该算法的有效性，我们在 5 个标准数据集上实施了该算法--鸢尾物种、皮马糖尿病数据集、大学数据集、西雅图天气和水质数据集。在剔除异常值后，该算法旨在形成密集的聚类。与 K-means 聚类相比，在所有 5 个数据集上，该算法都能给出更好的剪影得分。虹膜物种的剪影得分最高，为 0.7；K-means 算法和拟议算法的剪影得分之差最高，为 0.49。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Novel approach for formation of Dense Clusters by Outlier Elimination and Standard Deviation

Outliers are the uncommon data points which deviate from the majority of the Data from a dataset. Presence of Outliers can affect the model’s performance, leading to incorrect data analysis. Hence, identifying and eliminating Outliers is a crucial pre-processing step. This research paper suggests a method for removing outliers that takes standard deviation into account. Standard Deviation is a statistical measure which measures the dispersion within the dataset. In the proposed algorithm, the first step is to calculate Standard Deviations of all the features within the Dataset. Next, the feature with highest Standard Deviation is chosen. After normalization of this column, individual Standardized values for the data points are calculated from the standardized Median. Furthermore, these values are arranged in the ascending order. Selecting closest left 85% and right 85% values from the Standardized Median. For the remaining features, only those observations are selected which are corresponding to the above selected range of data points. To check the efficacy of this algorithm, it is implemented on 5 Standard datasets - Iris Species, Pima Diabetes Dataset, College Dataset, Seattle Weather, Water Quality Dataset. After elimination of Outliers, the proposed algorithm aims to form dense clusters. When compared with K-means clustering, for all the 5 datasets, it gives a better Silhouette Score. The highest score of 0.7 is for Iris Species and the highest difference of 0.49 between the Silhouette score of K-means and the proposed algorithm is for Water Quality Dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)

自引率

0.00%

发文量