{"title":"Bench marking of classification algorithms: Decision Trees and Random Forests - a case study using R","authors":"Manish Varma Datla","doi":"10.1109/ITACT.2015.7492647","DOIUrl":null,"url":null,"abstract":"Decision Trees and Random Forests are leading Machine Learning Algorithms, which are used for Classification purposes. Through the course of this paper, a comparison is made of classification results of these two algorithms, for classifying data sets obtained from Kaggle's Bike Sharing System and Titanic problems. The solution methodology deployed is primarily broken into two segments. First, being Feature Engineering where the given instance variables are made noise free and two or more variables are used together to give rise to a valuable third. Secondly, the classification parameters are worked out, consisting of correctly classified instances, incorrectly classified instances, Precision and Accuracy. This process ensured that the instance variables and classification parameters were best treated before they were deployed with the two algorithms i.e. Decision Trees and Random Forests. The developed model has been validated by using Systems data and the Classification results. From the model it can safely be concluded that for all classification problems Decision Trees is handy with small data sets i.e. less number of instances and Random Forests gives better results for the same number of attributes and large data sets i.e. with greater number of instances. R language has been used to solve the problem and to present the results.","PeriodicalId":336783,"journal":{"name":"2015 International Conference on Trends in Automation, Communications and Computing Technology (I-TACT-15)","volume":"163 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Trends in Automation, Communications and Computing Technology (I-TACT-15)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITACT.2015.7492647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Decision Trees and Random Forests are leading Machine Learning Algorithms, which are used for Classification purposes. Through the course of this paper, a comparison is made of classification results of these two algorithms, for classifying data sets obtained from Kaggle's Bike Sharing System and Titanic problems. The solution methodology deployed is primarily broken into two segments. First, being Feature Engineering where the given instance variables are made noise free and two or more variables are used together to give rise to a valuable third. Secondly, the classification parameters are worked out, consisting of correctly classified instances, incorrectly classified instances, Precision and Accuracy. This process ensured that the instance variables and classification parameters were best treated before they were deployed with the two algorithms i.e. Decision Trees and Random Forests. The developed model has been validated by using Systems data and the Classification results. From the model it can safely be concluded that for all classification problems Decision Trees is handy with small data sets i.e. less number of instances and Random Forests gives better results for the same number of attributes and large data sets i.e. with greater number of instances. R language has been used to solve the problem and to present the results.
决策树和随机森林是主要的机器学习算法,用于分类目的。通过本文的过程,比较了这两种算法的分类结果,分别对Kaggle’s Bike Sharing System和Titanic问题的数据集进行分类。部署的解决方案方法主要分为两个部分。首先,在特征工程中,给定的实例变量是无噪声的,两个或更多的变量一起使用来产生有价值的第三个变量。其次,确定了分类参数,包括正确分类实例、错误分类实例、Precision和Accuracy;这个过程确保了实例变量和分类参数在使用决策树和随机森林两种算法部署之前得到了最好的处理。利用系统数据和分类结果对所建立的模型进行了验证。从模型中可以安全地得出结论,对于所有分类问题,决策树对于小数据集(即较少的实例数)很方便,而随机森林对于相同数量的属性和大数据集(即具有更多的实例数)给出了更好的结果。使用R语言来解决问题并给出结果。