Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges

Natural Language Processing Journal Pub Date : 2025-05-06 DOI:10.1016/j.nlp.2025.100154

Mario Graff , Daniela Moctezuma , Eric S. Téllez

{"title":"Bag-of-Word approach is not dead: A performance analysis on a myriad of text classification challenges","authors":"Mario Graff , Daniela Moctezuma , Eric S. Téllez","doi":"10.1016/j.nlp.2025.100154","DOIUrl":null,"url":null,"abstract":"<div><div>The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"11 ","pages":"Article 100154"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Bag-of-Words (BoW) representation, enhanced with a classifier, was a pioneering approach to solving text classification problems. However, with the advent of transformers and, in general, deep learning architectures, the field has dynamically shifted its focus towards customizing these architectures for various natural language processing tasks, including text classification problems. For a newcomer, it might be impossible to realize that for some text classification problems, the traditional approach is still competitive. This research analyzes the competitiveness of BoW-based representations in different text-classification competitions run in English, Spanish, and Italian. To analyze the performance of these BoW-based representations, we participated in 12 text classification international competitions, summing up 24 tasks comprising five English tasks, seven in Italian, and twelve in Spanish. The results show that the proposed BoW representations have a difference of just 10% w.r.t. the competition winner and less than 2% in three tasks corresponding to author profiling. BoW outperforms BERT solutions and dominates in author profiling tasks.

查看原文本刊更多论文

Bag-of-Word方法并没有消亡：对无数文本分类挑战的性能分析

用分类器增强的词袋（BoW）表示是解决文本分类问题的一种开创性方法。然而，随着变形器和深度学习体系结构的出现，该领域已经动态地将重点转向为各种自然语言处理任务定制这些体系结构，包括文本分类问题。对于新手来说，可能无法意识到对于某些文本分类问题，传统方法仍然具有竞争力。本研究分析了基于bow的表示在英语、西班牙语和意大利语的不同文本分类竞赛中的竞争力。为了分析这些基于bow的表示的性能，我们参加了12个文本分类国际比赛，总结了24个任务，其中包括5个英语任务，7个意大利语任务和12个西班牙语任务。结果表明，所提出的BoW表示与竞赛获胜者的w.r.t.差异仅为10%，与作者分析对应的三个任务的w.r.t.差异小于2%。BoW优于BERT解决方案，在作者分析任务中占主导地位。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量