Handling imbalanced dataset in multi-label text categorization using Bagging and Adaptive Boosting

Genta Indra Winata, M. L. Khodra
{"title":"Handling imbalanced dataset in multi-label text categorization using Bagging and Adaptive Boosting","authors":"Genta Indra Winata, M. L. Khodra","doi":"10.1109/ICEEI.2015.7352552","DOIUrl":null,"url":null,"abstract":"Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting algorithms are employed to handle the issue and improve the performance of text categorization. The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure. Bagging.ML-LP with SMO weak classifier is the best performer in terms of subset accuracy and example-based accuracy. Bagging.ML-BR with SMO weak classifier has the best micro-averaged f-measure among all. In other hand, AdaBoost.MH with J48 weak classifier has the lowest hamming loss value. Thus, both algorithms have high potential in boosting the performance of text categorization, but only for certain weak classifiers. However, bagging has more potential than adaptive boosting in increasing the accuracy of minority labels.","PeriodicalId":426454,"journal":{"name":"2015 International Conference on Electrical Engineering and Informatics (ICEEI)","volume":"10 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Electrical Engineering and Informatics (ICEEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEI.2015.7352552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting algorithms are employed to handle the issue and improve the performance of text categorization. The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure. Bagging.ML-LP with SMO weak classifier is the best performer in terms of subset accuracy and example-based accuracy. Bagging.ML-BR with SMO weak classifier has the best micro-averaged f-measure among all. In other hand, AdaBoost.MH with J48 weak classifier has the lowest hamming loss value. Thus, both algorithms have high potential in boosting the performance of text categorization, but only for certain weak classifiers. However, bagging has more potential than adaptive boosting in increasing the accuracy of minority labels.
使用Bagging和Adaptive Boosting处理多标签文本分类中的不平衡数据集
不平衡数据集是由于万隆政府办公室投诉处理等现实世界中可用数据的不均匀分布而产生的。因此,多标签文本分类算法可能不会产生最佳性能,因为分类器倾向于被大多数数据加权而忽略少数数据。本文采用Bagging和Adaptive Boosting算法来解决这一问题,提高了文本分类的性能。用汉明损失、子集精度、基于实例的精度和微平均f测度等4个评价指标对结果进行了评价。装袋。具有SMO弱分类器的ML-LP在子集精度和基于示例的精度方面表现最好。装袋。具有SMO弱分类器的ML-BR具有最好的微平均f测度。另一方面,AdaBoost。J48弱分类器的MH具有最低的汉明损失值。因此,这两种算法在提高文本分类性能方面都有很大的潜力,但仅适用于某些弱分类器。然而,在提高少数民族标签的准确性方面,装袋比自适应提升更有潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信