SimFiller。基于相似性的缺失值填充算法

Fateh ur Rehman, M. Abbas, Sajjad Murtaza, Wasi Haider Butt, S. Rehman, Usman Qamar
{"title":"SimFiller。基于相似性的缺失值填充算法","authors":"Fateh ur Rehman, M. Abbas, Sajjad Murtaza, Wasi Haider Butt, S. Rehman, Usman Qamar","doi":"10.1109/ICDIM.2018.8846983","DOIUrl":null,"url":null,"abstract":"With the growth of heterogeneous data generation sources low-quality data volumes are expanding on a daily basis. This research proposed SimFiller: similarity-based missing (null) values filling algorithm, to enhance the quality of data for the data mining process. The proposed algorithm calculates the similarity of record pairs from the input data in such a way that at least one member of the pair has a non-null value for the attribute under consideration. After finding similar pairs, the algorithm fills the missing values by considering the pair having greatest similarity under the specified similarity threshold. The quality of resulted data is evaluated by analyzing the classification accuracy results for Audiology dataset. Five other missing values filling algorithms were selected and total six copies of filled Audiology dataset were created. All six copies of filled Audiology dataset were tested for their classification accuracy. Results show a huge boost in classification accuracy for the copy of the dataset filled with the proposed algorithm and indicate that the quality of the dataset is enhanced. The proposed algorithm can also be tested on other datasets for filling their missing (null) values and can also be extended to remove other inconsistencies from the datasets.","PeriodicalId":120884,"journal":{"name":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"SimFiller. Similarity-Based Missing Values Filling Algorithm\",\"authors\":\"Fateh ur Rehman, M. Abbas, Sajjad Murtaza, Wasi Haider Butt, S. Rehman, Usman Qamar\",\"doi\":\"10.1109/ICDIM.2018.8846983\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growth of heterogeneous data generation sources low-quality data volumes are expanding on a daily basis. This research proposed SimFiller: similarity-based missing (null) values filling algorithm, to enhance the quality of data for the data mining process. The proposed algorithm calculates the similarity of record pairs from the input data in such a way that at least one member of the pair has a non-null value for the attribute under consideration. After finding similar pairs, the algorithm fills the missing values by considering the pair having greatest similarity under the specified similarity threshold. The quality of resulted data is evaluated by analyzing the classification accuracy results for Audiology dataset. Five other missing values filling algorithms were selected and total six copies of filled Audiology dataset were created. All six copies of filled Audiology dataset were tested for their classification accuracy. Results show a huge boost in classification accuracy for the copy of the dataset filled with the proposed algorithm and indicate that the quality of the dataset is enhanced. The proposed algorithm can also be tested on other datasets for filling their missing (null) values and can also be extended to remove other inconsistencies from the datasets.\",\"PeriodicalId\":120884,\"journal\":{\"name\":\"2018 Thirteenth International Conference on Digital Information Management (ICDIM)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Thirteenth International Conference on Digital Information Management (ICDIM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDIM.2018.8846983\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2018.8846983","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着异构数据源的增长,低质量的数据量每天都在扩大。本研究提出SimFiller:基于相似性的缺失(null)值填充算法,为数据挖掘过程提升数据质量。所提出的算法根据输入数据计算记录对的相似性,使记录对中至少有一个成员具有所考虑的属性的非空值。算法在找到相似对后,在指定的相似阈值下,考虑相似度最大的对来填充缺失值。通过对听力学数据集的分类精度结果进行分析,评价结果数据的质量。选择另外5种缺失值填充算法,共创建了6份填充后的听力学数据集。所有六个副本填充的听力学数据集进行了分类准确性测试。结果表明,使用该算法填充的数据集副本的分类精度有了很大的提高,表明数据集的质量得到了提高。提出的算法还可以在其他数据集上进行测试,以填充其缺失(null)值,并且还可以扩展到从数据集中删除其他不一致的数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SimFiller. Similarity-Based Missing Values Filling Algorithm
With the growth of heterogeneous data generation sources low-quality data volumes are expanding on a daily basis. This research proposed SimFiller: similarity-based missing (null) values filling algorithm, to enhance the quality of data for the data mining process. The proposed algorithm calculates the similarity of record pairs from the input data in such a way that at least one member of the pair has a non-null value for the attribute under consideration. After finding similar pairs, the algorithm fills the missing values by considering the pair having greatest similarity under the specified similarity threshold. The quality of resulted data is evaluated by analyzing the classification accuracy results for Audiology dataset. Five other missing values filling algorithms were selected and total six copies of filled Audiology dataset were created. All six copies of filled Audiology dataset were tested for their classification accuracy. Results show a huge boost in classification accuracy for the copy of the dataset filled with the proposed algorithm and indicate that the quality of the dataset is enhanced. The proposed algorithm can also be tested on other datasets for filling their missing (null) values and can also be extended to remove other inconsistencies from the datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信