{"title":"利用集成、数据采样和特征选择技术提高Tweet情绪数据的分类性能","authors":"Joseph D. Prusa, T. Khoshgoftaar, Amri Napolitano","doi":"10.1109/ICMLA.2015.21","DOIUrl":null,"url":null,"abstract":"Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Utilizing Ensemble, Data Sampling and Feature Selection Techniques for Improving Classification Performance on Tweet Sentiment Data\",\"authors\":\"Joseph D. Prusa, T. Khoshgoftaar, Amri Napolitano\",\"doi\":\"10.1109/ICMLA.2015.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Utilizing Ensemble, Data Sampling and Feature Selection Techniques for Improving Classification Performance on Tweet Sentiment Data
Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.