聚类环境下缺失值的数据输入方法

M. Aktaş, Sinan Kaplan, H. Abaci, O. Kalipsiz, U. Ketenci, Umut Orçun Turgut
{"title":"聚类环境下缺失值的数据输入方法","authors":"M. Aktaş, Sinan Kaplan, H. Abaci, O. Kalipsiz, U. Ketenci, Umut Orçun Turgut","doi":"10.4018/978-1-5225-7519-1.CH011","DOIUrl":null,"url":null,"abstract":"Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.","PeriodicalId":153959,"journal":{"name":"Big Data and Knowledge Sharing in Virtual Organizations","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Data Imputation Methods for Missing Values in the Context of Clustering\",\"authors\":\"M. Aktaş, Sinan Kaplan, H. Abaci, O. Kalipsiz, U. Ketenci, Umut Orçun Turgut\",\"doi\":\"10.4018/978-1-5225-7519-1.CH011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.\",\"PeriodicalId\":153959,\"journal\":{\"name\":\"Big Data and Knowledge Sharing in Virtual Organizations\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data and Knowledge Sharing in Virtual Organizations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4018/978-1-5225-7519-1.CH011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Knowledge Sharing in Virtual Organizations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/978-1-5225-7519-1.CH011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

数据缺失是影响数据聚类质量的常见问题。大多数现实生活中的数据集都有缺失的数据,这反过来又会对聚类任务产生一些影响。本章研究了不同缺失数据稀缺性分布(包括伽马分布、高斯分布和beta分布)的适当数据处理方法。分析的数据归算方法包括均值、热甲板、回归、k近邻、期望最大化和多重归算。为了揭示处理缺失数据的正确方法,利用聚类等数据挖掘任务进行评估。通过实验研究,本章确定了缺失数据输入方法与聚类任务缺失数据分布之间的相关性。实验结果表明,期望最大化和k近邻方法对不同缺失数据稀缺性分布提供了最好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Data Imputation Methods for Missing Values in the Context of Clustering
Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信