Renzo Alva Principe, Nicola Chiarini, Marco Viviani
{"title":"Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues","authors":"Renzo Alva Principe, Nicola Chiarini, Marco Viviani","doi":"10.1002/widm.70019","DOIUrl":null,"url":null,"abstract":"Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre‐training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock‐on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long‐document benchmark datasets.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WIREs Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/widm.70019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre‐training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock‐on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long‐document benchmark datasets.