基于样本选择和余弦相似度的C4.5算法

2019 IEEE 5th International Conference on Computer and Communications (ICCC) Pub Date : 2019-12-01 DOI:10.1109/ICCC47050.2019.9064346

Suzhi Zhang, Xiao-Ni Chen

{"title":"基于样本选择和余弦相似度的C4.5算法","authors":"Suzhi Zhang, Xiao-Ni Chen","doi":"10.1109/ICCC47050.2019.9064346","DOIUrl":null,"url":null,"abstract":"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.","PeriodicalId":6739,"journal":{"name":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","volume":"29 1","pages":"490-495"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"C4.5 Algorithm Based on the Sample Selection and Cosine Similarity\",\"authors\":\"Suzhi Zhang, Xiao-Ni Chen\",\"doi\":\"10.1109/ICCC47050.2019.9064346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.\",\"PeriodicalId\":6739,\"journal\":{\"name\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"volume\":\"29 1\",\"pages\":\"490-495\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCC47050.2019.9064346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCC47050.2019.9064346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

为了提高分类准确率，缩短高维大样本集的训练时间，减少决策树的冗余规则，提出了一种基于样本选择和余弦相似度的C4.5改进算法。在处理大样本集时，该算法首先采用统计学最优样本量算法来确定数据集的最优样本量。然后，从数据集中选择最优大小的样本，作为迭代信息对所选训练样本的精度进行高度优化，从迭代过程中找到最优的训练集。之后，根据训练样本中任意两个属性信息熵的差值，搜索潜在相似属性，计算潜在相似属性对的余弦相似度，将相似度在阈值范围内的属性进行组合，计算合并后属性的信息增益率。最后在传统C4.5算法的基础上选择最优拆分属性，构建决策树。仿真结果表明，与传统的C4.5算法相比，该算法减少了冗余规则，提高了执行效率和分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

C4.5 Algorithm Based on the Sample Selection and Cosine Similarity

In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE 5th International Conference on Computer and Communications (ICCC)

自引率

0.00%

发文量