性能指标在包装器特性选择中的重要性

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI) Pub Date : 2013-08-01 DOI:10.1109/IRI.2013.6642460

Randall Wald, T. Khoshgoftaar, Amri Napolitano

{"title":"性能指标在包装器特性选择中的重要性","authors":"Randall Wald, T. Khoshgoftaar, Amri Napolitano","doi":"10.1109/IRI.2013.6642460","DOIUrl":null,"url":null,"abstract":"Many important datasets are affected by the problem of high dimensionality (having a large number of attributes or features), which can result in complex and time-consuming classification models. Feature selection techniques try to identify an optimal subset of features which may show improved classification performance as well as identify important features for the application at hand. Wrapper feature selection in particular uses a classifier to discover which feature subsets are most useful. However, feature selection can be affected by another dataset problem: imbalanced data. When one class outnumbers the other class(es), the chosen features may not reflect those most important to all classes - especially when wrapper feature selection uses a performance metric which does not consider class imbalance. No previous work has examined how the choice of performance metric within wrapper-based feature selection will affect classification performance. To study this effect, in this paper we consider two high-dimensional datasets drawn from the field of Twitter profile mining, both of which exhibit class imbalance. Using the Logistic Regression learner, we perform wrapper feature selection followed by classification, using five different performance metrics both (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision Recall Curve, Best Arithmetic Mean of TPR and TNR, Best Geometric Mean of TPR and TNR, and Overall Accuracy) for the wrapper and for evaluating the classification model. We find that performance metrics which take class imbalance into account, especially the Area Under the Precision-Recall Curve, are far more effective than Overall Accuracy when used within the wrapper, producing much better performance as evaluated by the metrics which consider imbalance. In fact, even when Overall Accuracy is the classification metric, it is not the best metric to use within the wrapper. In addition, we find that there is no direct connection between the metric inside the wrapper and used for classification evaluation: the metrics show similar patterns across all four balance-aware metrics (e.g., all but Overall Accuracy).","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"The importance of performance metrics within wrapper feature selection\",\"authors\":\"Randall Wald, T. Khoshgoftaar, Amri Napolitano\",\"doi\":\"10.1109/IRI.2013.6642460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many important datasets are affected by the problem of high dimensionality (having a large number of attributes or features), which can result in complex and time-consuming classification models. Feature selection techniques try to identify an optimal subset of features which may show improved classification performance as well as identify important features for the application at hand. Wrapper feature selection in particular uses a classifier to discover which feature subsets are most useful. However, feature selection can be affected by another dataset problem: imbalanced data. When one class outnumbers the other class(es), the chosen features may not reflect those most important to all classes - especially when wrapper feature selection uses a performance metric which does not consider class imbalance. No previous work has examined how the choice of performance metric within wrapper-based feature selection will affect classification performance. To study this effect, in this paper we consider two high-dimensional datasets drawn from the field of Twitter profile mining, both of which exhibit class imbalance. Using the Logistic Regression learner, we perform wrapper feature selection followed by classification, using five different performance metrics both (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision Recall Curve, Best Arithmetic Mean of TPR and TNR, Best Geometric Mean of TPR and TNR, and Overall Accuracy) for the wrapper and for evaluating the classification model. We find that performance metrics which take class imbalance into account, especially the Area Under the Precision-Recall Curve, are far more effective than Overall Accuracy when used within the wrapper, producing much better performance as evaluated by the metrics which consider imbalance. In fact, even when Overall Accuracy is the classification metric, it is not the best metric to use within the wrapper. In addition, we find that there is no direct connection between the metric inside the wrapper and used for classification evaluation: the metrics show similar patterns across all four balance-aware metrics (e.g., all but Overall Accuracy).\",\"PeriodicalId\":418492,\"journal\":{\"name\":\"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2013.6642460\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642460","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

许多重要的数据集受到高维(具有大量属性或特征)问题的影响，这可能导致分类模型复杂且耗时。特征选择技术试图确定一个最优的特征子集，这些特征可以显示出改进的分类性能，并为手头的应用程序识别重要的特征。包装器特征选择特别使用分类器来发现哪些特征子集是最有用的。然而，特征选择可能会受到另一个数据集问题的影响:数据不平衡。当一个类的数量超过其他类时，所选择的特性可能不会反映对所有类最重要的特性—特别是当包装器特性选择使用不考虑类不平衡的性能指标时。以前没有研究过在基于包装器的特征选择中性能度量的选择将如何影响分类性能。为了研究这种影响，本文考虑了来自Twitter个人资料挖掘领域的两个高维数据集，这两个数据集都表现出类别不平衡。使用Logistic回归学习器，我们执行包装特征选择，然后进行分类，使用五种不同的性能指标(接收者工作特征曲线下的面积，精确召回曲线下的面积，TPR和TNR的最佳算术平均值，TPR和TNR的最佳几何平均值，以及总体准确性)来评估包装和分类模型。我们发现，考虑到类不平衡的性能指标，特别是精确召回曲线下的面积，在包装器中使用时远比总体精度有效，通过考虑不平衡的指标评估产生更好的性能。事实上，即使Overall Accuracy是分类度量，它也不是在包装器中使用的最佳度量。此外，我们发现包装器内部的度量和用于分类评估的度量之间没有直接的联系:度量在所有四个感知平衡的度量中显示相似的模式(例如，除了Overall Accuracy之外的所有度量)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The importance of performance metrics within wrapper feature selection

Many important datasets are affected by the problem of high dimensionality (having a large number of attributes or features), which can result in complex and time-consuming classification models. Feature selection techniques try to identify an optimal subset of features which may show improved classification performance as well as identify important features for the application at hand. Wrapper feature selection in particular uses a classifier to discover which feature subsets are most useful. However, feature selection can be affected by another dataset problem: imbalanced data. When one class outnumbers the other class(es), the chosen features may not reflect those most important to all classes - especially when wrapper feature selection uses a performance metric which does not consider class imbalance. No previous work has examined how the choice of performance metric within wrapper-based feature selection will affect classification performance. To study this effect, in this paper we consider two high-dimensional datasets drawn from the field of Twitter profile mining, both of which exhibit class imbalance. Using the Logistic Regression learner, we perform wrapper feature selection followed by classification, using five different performance metrics both (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision Recall Curve, Best Arithmetic Mean of TPR and TNR, Best Geometric Mean of TPR and TNR, and Overall Accuracy) for the wrapper and for evaluating the classification model. We find that performance metrics which take class imbalance into account, especially the Area Under the Precision-Recall Curve, are far more effective than Overall Accuracy when used within the wrapper, producing much better performance as evaluated by the metrics which consider imbalance. In fact, even when Overall Accuracy is the classification metric, it is not the best metric to use within the wrapper. In addition, we find that there is no direct connection between the metric inside the wrapper and used for classification evaluation: the metrics show similar patterns across all four balance-aware metrics (e.g., all but Overall Accuracy).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)

自引率

0.00%

发文量