Filter- and wrapper-based feature selection for predicting user interaction with Twitter bots

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI) Pub Date : 2013-10-24 DOI:10.1109/IRI.2013.6642501

Randall Wald, T. Khoshgoftaar, Amri Napolitano

{"title":"Filter- and wrapper-based feature selection for predicting user interaction with Twitter bots","authors":"Randall Wald, T. Khoshgoftaar, Amri Napolitano","doi":"10.1109/IRI.2013.6642501","DOIUrl":null,"url":null,"abstract":"High dimensionality (the presence of too many features) is a problem which plagues many datasets, including mining from personality profiles. Feature selection can be used to reduce the number of features, and many strategies have been proposed to help select the most important features from a larger group. Feature rankers will produce a metric for each feature and return the best for a given subset size, while filter-based subset evaluation will perform statistical analysis on whole subsets and wrapper-based subset selection will use classification models with chosen features to decide which are most important for model-building. While all three approaches have been discussed in the literature, relatively little work compares all three with one another directly. In the present study, we do precisely this, considering feature ranking, filter-based subset evaluation, and wrapper-based subset selection (along with no feature ranking) on two datasets based on predicting interaction with bots on Twitter. For the two subset-based techniques, we consider two search techniques (Best First and Greedy Stepwise) to build the subsets, while we use one feature ranker (ROC) chosen for its excellent performance in previous works. Six learners are used to build models with the selected features. We find that feature ranking consistently performs well, giving the best results for four of the six learners on both datasets. In addition, all of the techniques other than feature ranking perform worse than no feature selection for four of six learners. This leads us to recommend the use of feature ranking over more complex subset evaluation techniques.","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

High dimensionality (the presence of too many features) is a problem which plagues many datasets, including mining from personality profiles. Feature selection can be used to reduce the number of features, and many strategies have been proposed to help select the most important features from a larger group. Feature rankers will produce a metric for each feature and return the best for a given subset size, while filter-based subset evaluation will perform statistical analysis on whole subsets and wrapper-based subset selection will use classification models with chosen features to decide which are most important for model-building. While all three approaches have been discussed in the literature, relatively little work compares all three with one another directly. In the present study, we do precisely this, considering feature ranking, filter-based subset evaluation, and wrapper-based subset selection (along with no feature ranking) on two datasets based on predicting interaction with bots on Twitter. For the two subset-based techniques, we consider two search techniques (Best First and Greedy Stepwise) to build the subsets, while we use one feature ranker (ROC) chosen for its excellent performance in previous works. Six learners are used to build models with the selected features. We find that feature ranking consistently performs well, giving the best results for four of the six learners on both datasets. In addition, all of the techniques other than feature ranking perform worse than no feature selection for four of six learners. This leads us to recommend the use of feature ranking over more complex subset evaluation techniques.

查看原文本刊更多论文

基于过滤器和包装器的特征选择，用于预测用户与Twitter机器人的交互

高维(存在太多特征)是困扰许多数据集的问题，包括从个性档案中挖掘。特征选择可以用来减少特征的数量，并且已经提出了许多策略来帮助从更大的组中选择最重要的特征。特征排序器将为每个特征生成一个度量，并返回给定子集大小的最佳值，而基于过滤器的子集评估将对整个子集执行统计分析，基于包装器的子集选择将使用具有选定特征的分类模型来决定哪些对模型构建最重要。虽然这三种方法都在文献中讨论过，但相对较少的工作将这三种方法直接进行比较。在目前的研究中，我们正是这样做的，在两个数据集上考虑特征排序、基于过滤器的子集评估和基于包装器的子集选择(以及没有特征排序)，这些数据集基于预测Twitter上与机器人的交互。对于这两种基于子集的技术，我们考虑了两种搜索技术(Best First和Greedy Stepwise)来构建子集，而我们使用了一种特征排序器(ROC)，因为它在以前的作品中表现出色。使用六个学习器构建具有选定特征的模型。我们发现特征排序始终表现良好，在两个数据集上，六个学习器中的四个都给出了最好的结果。此外，对于6个学习器中的4个，除了特征排序之外的所有技术的表现都不如没有特征选择。这导致我们推荐使用特征排序而不是更复杂的子集评估技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)

自引率

0.00%

发文量