特征选择对短文本分类的影响

J. Jayakody, Vgtn Vidanagama, Indika Perera, Hmlk Herath
{"title":"特征选择对短文本分类的影响","authors":"J. Jayakody, Vgtn Vidanagama, Indika Perera, Hmlk Herath","doi":"10.1109/SCSE59836.2023.10215041","DOIUrl":null,"url":null,"abstract":"feature selection technique is used in text classification pipeline to reduce the number of redundant or irrelevant features. Moreover, feature selection algorithms help to decrease the overfitting, reduce training time and improve the accuracy of the build models. Similarly, feature reduction techniques based on frequencies support eliminating unwanted features. Most of the existing work related to feature selection was based on general text and the behavior of feature selection was not evaluated properly with short text type dataset. Therefore this research was conducted to investigate how performance varied with selected features from feature selection algorithms with short text type datasets. Three publicly available datasets were selected for the experiment. Chi square, info gain and f measure were examined as those algorithms were identified as the best algorithms to select features for text classification. Moreover, we examined the impact of those algorithms when selecting different types of features such as 1gram and 2-gram. Finally, we look at the impact of frequency based feature reduction techniques with the selected dataset. Our results showed that info gain algorithm outperform other two algorithms. Moreover, selection of best 20% feature set with info gain algorithm provide the same performance level as with the entire feature set. Further we observed the higher number of dimensions was due to bigrams and the impact of n grams towards feature selection algorithms. Moreover, it is worth noting that removing the features which occur twice in a document would be ideal before moving to apply feature selection techniques with different algorithms.","PeriodicalId":429228,"journal":{"name":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Impact of Feature Selection Towards Short Text Classification\",\"authors\":\"J. Jayakody, Vgtn Vidanagama, Indika Perera, Hmlk Herath\",\"doi\":\"10.1109/SCSE59836.2023.10215041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"feature selection technique is used in text classification pipeline to reduce the number of redundant or irrelevant features. Moreover, feature selection algorithms help to decrease the overfitting, reduce training time and improve the accuracy of the build models. Similarly, feature reduction techniques based on frequencies support eliminating unwanted features. Most of the existing work related to feature selection was based on general text and the behavior of feature selection was not evaluated properly with short text type dataset. Therefore this research was conducted to investigate how performance varied with selected features from feature selection algorithms with short text type datasets. Three publicly available datasets were selected for the experiment. Chi square, info gain and f measure were examined as those algorithms were identified as the best algorithms to select features for text classification. Moreover, we examined the impact of those algorithms when selecting different types of features such as 1gram and 2-gram. Finally, we look at the impact of frequency based feature reduction techniques with the selected dataset. Our results showed that info gain algorithm outperform other two algorithms. Moreover, selection of best 20% feature set with info gain algorithm provide the same performance level as with the entire feature set. Further we observed the higher number of dimensions was due to bigrams and the impact of n grams towards feature selection algorithms. Moreover, it is worth noting that removing the features which occur twice in a document would be ideal before moving to apply feature selection techniques with different algorithms.\",\"PeriodicalId\":429228,\"journal\":{\"name\":\"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)\",\"volume\":\"93 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SCSE59836.2023.10215041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCSE59836.2023.10215041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在文本分类管道中使用特征选择技术来减少冗余或不相关特征的数量。此外,特征选择算法有助于减少过拟合,减少训练时间,提高构建模型的准确性。类似地,基于频率的特征约简技术支持消除不需要的特征。现有的特征选择工作大多是基于一般文本的,对短文本类型数据集的特征选择行为评价不足。因此,本研究旨在研究在短文本类型数据集上,从特征选择算法中选择的特征对性能的影响。实验选择了三个公开可用的数据集。卡方、信息增益和f测度被认为是选择文本分类特征的最佳算法。此外,我们在选择不同类型的特征(如1gram和2g)时检查了这些算法的影响。最后,我们研究了基于频率的特征约简技术对所选数据集的影响。结果表明,信息增益算法优于其他两种算法。此外,使用信息增益算法选择最佳的20%特征集可以提供与整个特征集相同的性能水平。此外,我们观察到更高的维数是由于双元图和n个图对特征选择算法的影响。此外,值得注意的是,在应用不同算法的特征选择技术之前,最好先删除文档中出现两次的特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Impact of Feature Selection Towards Short Text Classification
feature selection technique is used in text classification pipeline to reduce the number of redundant or irrelevant features. Moreover, feature selection algorithms help to decrease the overfitting, reduce training time and improve the accuracy of the build models. Similarly, feature reduction techniques based on frequencies support eliminating unwanted features. Most of the existing work related to feature selection was based on general text and the behavior of feature selection was not evaluated properly with short text type dataset. Therefore this research was conducted to investigate how performance varied with selected features from feature selection algorithms with short text type datasets. Three publicly available datasets were selected for the experiment. Chi square, info gain and f measure were examined as those algorithms were identified as the best algorithms to select features for text classification. Moreover, we examined the impact of those algorithms when selecting different types of features such as 1gram and 2-gram. Finally, we look at the impact of frequency based feature reduction techniques with the selected dataset. Our results showed that info gain algorithm outperform other two algorithms. Moreover, selection of best 20% feature set with info gain algorithm provide the same performance level as with the entire feature set. Further we observed the higher number of dimensions was due to bigrams and the impact of n grams towards feature selection algorithms. Moreover, it is worth noting that removing the features which occur twice in a document would be ideal before moving to apply feature selection techniques with different algorithms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信