基于样本选择和余弦相似度的C4.5算法

Suzhi Zhang, Xiao-Ni Chen
{"title":"基于样本选择和余弦相似度的C4.5算法","authors":"Suzhi Zhang, Xiao-Ni Chen","doi":"10.1109/ICCC47050.2019.9064346","DOIUrl":null,"url":null,"abstract":"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.","PeriodicalId":6739,"journal":{"name":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","volume":"29 1","pages":"490-495"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"C4.5 Algorithm Based on the Sample Selection and Cosine Similarity\",\"authors\":\"Suzhi Zhang, Xiao-Ni Chen\",\"doi\":\"10.1109/ICCC47050.2019.9064346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.\",\"PeriodicalId\":6739,\"journal\":{\"name\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"volume\":\"29 1\",\"pages\":\"490-495\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCC47050.2019.9064346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCC47050.2019.9064346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

为了提高分类准确率,缩短高维大样本集的训练时间,减少决策树的冗余规则,提出了一种基于样本选择和余弦相似度的C4.5改进算法。在处理大样本集时,该算法首先采用统计学最优样本量算法来确定数据集的最优样本量。然后,从数据集中选择最优大小的样本,作为迭代信息对所选训练样本的精度进行高度优化,从迭代过程中找到最优的训练集。之后,根据训练样本中任意两个属性信息熵的差值,搜索潜在相似属性,计算潜在相似属性对的余弦相似度,将相似度在阈值范围内的属性进行组合,计算合并后属性的信息增益率。最后在传统C4.5算法的基础上选择最优拆分属性,构建决策树。仿真结果表明,与传统的C4.5算法相比,该算法减少了冗余规则,提高了执行效率和分类精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
C4.5 Algorithm Based on the Sample Selection and Cosine Similarity
In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信