{"title":"基于样本选择和余弦相似度的C4.5算法","authors":"Suzhi Zhang, Xiao-Ni Chen","doi":"10.1109/ICCC47050.2019.9064346","DOIUrl":null,"url":null,"abstract":"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.","PeriodicalId":6739,"journal":{"name":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","volume":"29 1","pages":"490-495"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"C4.5 Algorithm Based on the Sample Selection and Cosine Similarity\",\"authors\":\"Suzhi Zhang, Xiao-Ni Chen\",\"doi\":\"10.1109/ICCC47050.2019.9064346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.\",\"PeriodicalId\":6739,\"journal\":{\"name\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"volume\":\"29 1\",\"pages\":\"490-495\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 5th International Conference on Computer and Communications (ICCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCC47050.2019.9064346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 5th International Conference on Computer and Communications (ICCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCC47050.2019.9064346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
C4.5 Algorithm Based on the Sample Selection and Cosine Similarity
In order to improve the accuracy of classification, shorten the training time of high-dimensional large sample set, and reduce the redundancy rules of decision tree, a C4.5 improved algorithm based on the sample selection and cosine similarity is proposed. Processing large sample set, the algorithm first uses the statistics optimum sample size algorithm to determine the optimum sample size of data set. Then, the optimal size samples are selected from the data set, the accuracy of the selected training samples is highly optimized as iterative information, and an optimal training set is found from the iterative process. After, according to the difference between any two attribute information entropies in the training sample, the potential similarity attribute is searched, the cosine similarity of the potential similar attribute pairs is calculated, the attributes with the similarity within the threshold range are combined, and the information gain rate of the merged attribute is calculated. Finally on the basis of the traditional C4.5 algorithm to choose the best split attribute, build a decision tree. The simulation results show that the proposed algorithm is compared with the traditional C4.5 algorithm, to reduce the redundant rules, improves the execution efficiency and classification accuracy.