利用集成、数据采样和特征选择技术提高Tweet情绪数据的分类性能

Joseph D. Prusa, T. Khoshgoftaar, Amri Napolitano
{"title":"利用集成、数据采样和特征选择技术提高Tweet情绪数据的分类性能","authors":"Joseph D. Prusa, T. Khoshgoftaar, Amri Napolitano","doi":"10.1109/ICMLA.2015.21","DOIUrl":null,"url":null,"abstract":"Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Utilizing Ensemble, Data Sampling and Feature Selection Techniques for Improving Classification Performance on Tweet Sentiment Data\",\"authors\":\"Joseph D. Prusa, T. Khoshgoftaar, Amri Napolitano\",\"doi\":\"10.1109/ICMLA.2015.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

对推文进行情感分析是一种流行的社交媒体意见挖掘方法。许多机器学习技术都可以提高分类器的性能,这些分类器经过训练可以确定tweet的情绪或情绪极性,然而,它们的设计目标不同,目前尚不清楚哪种技术最有益。此外,这些技术可能会根据数据问题的质量而表现不同,例如类不平衡,这是使用真实世界数据时的一个常见问题。为了确定哪些技术更重要,我们测试了12种技术,包括:8种特征选择技术,bagging, boosting和具有两个采样后类比的数据采样。使用五个基本学习器,我们将这些技术相互比较,并且每个基本学习器都没有额外的技术。我们在一个平衡数据集和两个具有不同类别比例的不平衡数据集上训练和测试每个分类器。此外,我们还进行了统计测试,以确定在技术之间观察到的差异是否显著。我们的结果表明,在所有三个数据集上,套袋和八种特征选择技术中的七种显著提高了性能(与不使用技术相比),而增强和数据采样对不平衡的tweet情绪数据的好处较小。据我们所知,这是第一个比较这三种类型的技术在推特情绪数据上的研究,也是第一个表明特征选择和集成技术在推特情绪数据上比数据采样表现更好的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Utilizing Ensemble, Data Sampling and Feature Selection Techniques for Improving Classification Performance on Tweet Sentiment Data
Sentiment analysis of tweets is a popular method of opinion mining social media. Many machine learning techniques exist that can improve the performance of classifiers trained to determine the sentiment or emotional polarity of a tweet, however, they are designed with different objectives and it is unclear which techniques are most beneficial. Additionally, these techniques may behave differently depending on quality of data issues, such as class imbalance, a common problem when using real world data. In an effort to determine which techniques are more important, we tested 12 techniques consisting of: eight feature selection techniques, bagging, boosting and data sampling with two post sampling class ratios. Using five base learners, we compare these techniques against each other and each base learners with no additional technique. We train and test each classifier on a balanced dataset and two imbalanced datasets with different class ratios. Additionally, we conduct statistical tests to determine if the differences observed between techniques are significant. Our results show that bagging and seven of the eight feature selection techniques significantly improve performance (compared to using no technique) on all three datasets, while boosting and data sampling are less beneficial for imbalanced tweet sentiment data. To the best of our knowledge, this is the first study comparing these three types of techniques on tweet sentiment data and the first to show that feature selection and ensemble techniques perform better than data sampling on tweet sentiment data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信