Data classification with k-NN using novel character frequency-direct word frequency (CF-DWF) similarity formula

M. A. Zardari, L. T. Jung
{"title":"Data classification with k-NN using novel character frequency-direct word frequency (CF-DWF) similarity formula","authors":"M. A. Zardari, L. T. Jung","doi":"10.1109/ISMSC.2015.7594066","DOIUrl":null,"url":null,"abstract":"The k-NN is one of the most popular and easy in implementation algorithm to classify the data. The best thing about k-NN is that it accepts changes with improved version. Despite many advantages of the k-NN, it is also facing many issues. These issues are: distance/similarity calculation complexity, training dataset complexity at classification phase, proper selection of k, and get duplicate values when training dataset is of single class. This paper focuses on only issue of distance/similarity calculation complexity. To avoid this complexity a new distance formula is proposed. The CF-DWF formula is only strings. The CF-DWF is no applicable for other data types. The F1-Score and precision of CF-DWF with k-NN are higher than traditional k-NN. The proposed similarity formula is also efficient than Euclidean Distance (E.D) and Cosine Similarity (C.S). The results section depicts that the k-NN with CF-DWF reduced computational complexity of k-NN with E.D and C.S from 4.77% to 43.69% and improved the F1-Score of traditional k-NN from 12% to 19%.","PeriodicalId":407600,"journal":{"name":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Symposium on Mathematical Sciences and Computing Research (iSMSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISMSC.2015.7594066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The k-NN is one of the most popular and easy in implementation algorithm to classify the data. The best thing about k-NN is that it accepts changes with improved version. Despite many advantages of the k-NN, it is also facing many issues. These issues are: distance/similarity calculation complexity, training dataset complexity at classification phase, proper selection of k, and get duplicate values when training dataset is of single class. This paper focuses on only issue of distance/similarity calculation complexity. To avoid this complexity a new distance formula is proposed. The CF-DWF formula is only strings. The CF-DWF is no applicable for other data types. The F1-Score and precision of CF-DWF with k-NN are higher than traditional k-NN. The proposed similarity formula is also efficient than Euclidean Distance (E.D) and Cosine Similarity (C.S). The results section depicts that the k-NN with CF-DWF reduced computational complexity of k-NN with E.D and C.S from 4.77% to 43.69% and improved the F1-Score of traditional k-NN from 12% to 19%.
基于字符频率-直接词频(CF-DWF)相似度公式的k-NN数据分类
k-NN是目前应用最广泛、实现最简单的数据分类算法之一。关于k-NN的最好的事情是它接受改进版本的变化。尽管k-NN有许多优点,但它也面临着许多问题。这些问题是:距离/相似度计算的复杂度,分类阶段训练数据集的复杂度,k的正确选择,当训练数据集为单一类时获得重复值。本文只关注距离/相似度计算复杂度问题。为了避免这种复杂性,提出了一个新的距离公式。CF-DWF公式只是字符串。CF-DWF不适用于其他数据类型。与传统的k-NN相比,基于k-NN的CF-DWF具有更高的f1得分和精度。所提出的相似度公式也比欧几里得距离(E.D)和余弦相似度(C.S)有效。结果部分描述了CF-DWF的k-NN将ed和cs的k-NN的计算复杂度从4.77%降低到43.69%,将传统k-NN的F1-Score从12%提高到19%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信