{"title":"数据集大小对Tweet情感分类器训练的影响","authors":"Joseph D. Prusa, T. Khoshgoftaar, Naeem Seliya","doi":"10.1109/ICMLA.2015.22","DOIUrl":null,"url":null,"abstract":"Using automated methods of labeling tweet sentiment, large volumes of tweets can be labeled and used to train classifiers. Millions of tweets could be used to train a classifier, however, doing so is computationally expensive. Thus, it is valuable to establish how many tweets should be utilized to train a classifier, since using additional instances with no gain in performance is a waste of resources. In this study, we seek to find out how many tweets are needed before no significant improvements are observed for sentiment analysis when adding additional instances. We train and evaluate classifiers using C4.5 decision tree, Naïve Bayes, 5 Nearest Neighbor and Radial Basis Function Network, with seven datasets varying from 1000 to 243,000 instances. Models are trained using four runs of 5-fold cross validation. Additionally, we conduct statistical tests to verify our observations and examine the impact of limiting features using frequency. All learners were found to improve with dataset size, with Naïve Bayes being the best performing learner. We found that Naïve Bayes did not significantly benefit from using more than 81,000 instances. To the best of our knowledge, this is the first study to investigate how learners scale in respect to dataset size with results verified using statistical tests and multiple models trained for each learner and dataset size. Additionally, we investigated using feature frequency to greatly reduce data grid size with either a small increase or decrease in classifier performance depending on choice of learner.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":"{\"title\":\"The Effect of Dataset Size on Training Tweet Sentiment Classifiers\",\"authors\":\"Joseph D. Prusa, T. Khoshgoftaar, Naeem Seliya\",\"doi\":\"10.1109/ICMLA.2015.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Using automated methods of labeling tweet sentiment, large volumes of tweets can be labeled and used to train classifiers. Millions of tweets could be used to train a classifier, however, doing so is computationally expensive. Thus, it is valuable to establish how many tweets should be utilized to train a classifier, since using additional instances with no gain in performance is a waste of resources. In this study, we seek to find out how many tweets are needed before no significant improvements are observed for sentiment analysis when adding additional instances. We train and evaluate classifiers using C4.5 decision tree, Naïve Bayes, 5 Nearest Neighbor and Radial Basis Function Network, with seven datasets varying from 1000 to 243,000 instances. Models are trained using four runs of 5-fold cross validation. Additionally, we conduct statistical tests to verify our observations and examine the impact of limiting features using frequency. All learners were found to improve with dataset size, with Naïve Bayes being the best performing learner. We found that Naïve Bayes did not significantly benefit from using more than 81,000 instances. To the best of our knowledge, this is the first study to investigate how learners scale in respect to dataset size with results verified using statistical tests and multiple models trained for each learner and dataset size. Additionally, we investigated using feature frequency to greatly reduce data grid size with either a small increase or decrease in classifier performance depending on choice of learner.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"94 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"42\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Effect of Dataset Size on Training Tweet Sentiment Classifiers
Using automated methods of labeling tweet sentiment, large volumes of tweets can be labeled and used to train classifiers. Millions of tweets could be used to train a classifier, however, doing so is computationally expensive. Thus, it is valuable to establish how many tweets should be utilized to train a classifier, since using additional instances with no gain in performance is a waste of resources. In this study, we seek to find out how many tweets are needed before no significant improvements are observed for sentiment analysis when adding additional instances. We train and evaluate classifiers using C4.5 decision tree, Naïve Bayes, 5 Nearest Neighbor and Radial Basis Function Network, with seven datasets varying from 1000 to 243,000 instances. Models are trained using four runs of 5-fold cross validation. Additionally, we conduct statistical tests to verify our observations and examine the impact of limiting features using frequency. All learners were found to improve with dataset size, with Naïve Bayes being the best performing learner. We found that Naïve Bayes did not significantly benefit from using more than 81,000 instances. To the best of our knowledge, this is the first study to investigate how learners scale in respect to dataset size with results verified using statistical tests and multiple models trained for each learner and dataset size. Additionally, we investigated using feature frequency to greatly reduce data grid size with either a small increase or decrease in classifier performance depending on choice of learner.