Learning with limited minority class data

Sixth International Conference on Machine Learning and Applications (ICMLA 2007) Pub Date : 2007-12-13 DOI:10.1109/ICMLA.2007.76

T. Khoshgoftaar, Chris Seiffert, J. V. Hulse, Amri Napolitano, A. Folleco

引用次数: 103

Abstract

A practical problem in data mining and machine learning is the limited availability of data. For example, in a binary classification problem it is often the case that examples of one class are abundant, while examples of the other class are in short supply. Examples from one class, typically the positive class, can be limited due to the financial cost or time required to collect these examples. This work presents a comprehensive empirical study of learning when examples from one class are extremely rare, but examples of the other class(es) are plentiful. Specifically, we address the issue of how many examples from the abundant class should be used when training a classifier on data where one class is very rare. Nearly one million classifiers were built and evaluated to generate the results presented in this work. Our results demonstrate that the often used 'even distribution' is not optimal when dealing with such rare events.

查看原文本刊更多论文

使用有限的少数族裔课堂数据进行学习

数据挖掘和机器学习中的一个实际问题是数据的有限可用性。例如，在二元分类问题中，通常会出现一类的例子很丰富，而另一类的例子却很短缺的情况。由于收集这些示例所需的财务成本或时间，来自一个类别(通常是正面类别)的示例可能会受到限制。这项工作提出了一个全面的学习实证研究，当一个类的例子非常罕见，但另一个类的例子是丰富的。具体来说，我们解决的问题是，当在一个类别非常罕见的数据上训练分类器时，应该使用多少个来自丰富类别的示例。近一百万个分类器被构建和评估，以产生本工作中呈现的结果。我们的结果表明，通常使用的“均匀分布”在处理此类罕见事件时不是最优的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sixth International Conference on Machine Learning and Applications (ICMLA 2007)

自引率

0.00%

发文量