A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles

The Journal of Cognitive Science Pub Date : 2017-12-31 DOI:10.17791/JCS.2017.18.4.391

Yirey Suh, Jae-Myung Yu, Jonghoon Mo, Cheongtag Kim

引用次数: 24

Abstract

Machine learning has progressed to match human performance, including the field of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.

查看原文本刊更多论文

韩文新闻文章不平衡主题分类的过采样方法比较

机器学习已经进步到可以与人类的表现相匹配，包括文本分类领域。然而，当训练数据不平衡时，分类器表现不佳。过采样是克服数据不平衡问题的一种方法，目前有许多简便易行的过采样方法。虽然对非文本数据的过采样方法进行了比较研究，但在统一框架下对文本数据的过采样方法进行比较的研究很少。本研究发现，虽然过采样方法通常可以提高分类器的性能，但相似度是影响分类器在不平衡和重采样数据上性能的重要因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Journal of Cognitive Science

自引率

0.00%

发文量