{"title":"Clustering Algorithms for Incomplete Datasets","authors":"Loai Abdallah, I. Shimshoni","doi":"10.5772/INTECHOPEN.78272","DOIUrl":null,"url":null,"abstract":"Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new ver- sions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was O 1 ð Þ . We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and com- pared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the devel- oped algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.","PeriodicalId":236959,"journal":{"name":"Recent Applications in Data Clustering","volume":"29 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Applications in Data Clustering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5772/INTECHOPEN.78272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new ver- sions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was O 1 ð Þ . We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and com- pared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the devel- oped algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.