Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues

WIREs Data Mining and Knowledge Discovery Pub Date : 2025-05-09 DOI:10.1002/widm.70019

Renzo Alva Principe, Nicola Chiarini, Marco Viviani

{"title":"Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues","authors":"Renzo Alva Principe, Nicola Chiarini, Marco Viviani","doi":"10.1002/widm.70019","DOIUrl":null,"url":null,"abstract":"Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre‐training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock‐on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long‐document benchmark datasets.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WIREs Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/widm.70019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre‐training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock‐on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long‐document benchmark datasets.

查看原文本刊更多论文

变压器时代的长文件分类：挑战、进展和有待解决的问题

自动文档分类（Automatic Document Classification， ADC）是指将文档自动分类或标记为预定义的类或类别的过程。它的有效性可能取决于各种因素，包括用于正式表示文档的模型、所应用的分类技术，或者两者的结合。最近，Transformer模型由于其在大型语料库上的预训练而受到欢迎，允许将灵活的知识转移到下游任务，例如ADC。然而，这样的模型在处理“长”文档时可能会面临挑战，特别是由于输入序列长度的限制，这可能会对我们称之为自动长文档分类（ALDC）的任务产生连锁反应。在过去的几年中，已经提出了不同的模型来解决变压器的这一限制，并用于执行ALDC；然而，它们在这项任务中的应用导致了一些不一致的结果，难以超越简单的基线，并且难以在不同的数据集和场景中进行推广。这就是为什么本调查旨在通过以下方式说明这些局限性：(i)提出当前长文件表示问题和文献中提出的解决方案；（ii）以这些解决方案为基础，综合分析其在ALDC中的应用及其有效性；（iii）讨论ALDC当前的评估策略，特别是参考合适的基线和实际的长文档基准数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

WIREs Data Mining and Knowledge Discovery

自引率

0.00%

发文量