IRIS：长分散文档序列的可解释检索增强分类。

Proceedings of the conference. Association for Computational Linguistics. Meeting Pub Date : 2025-07-01

Fengnan Li, Elliot D Hill, Shu Jiang, Jiaxin Gao, Matthew M Engelhard

{"title":"IRIS：长分散文档序列的可解释检索增强分类。","authors":"Fengnan Li, Elliot D Hill, Shu Jiang, Jiaxin Gao, Matthew M Engelhard","doi":"","DOIUrl":null,"url":null,"abstract":"Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"2025 ","pages":"30263-30283"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12357761/pdf/","citationCount":"0","resultStr":"{\"title\":\"IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.\",\"authors\":\"Fengnan Li, Elliot D Hill, Shu Jiang, Jiaxin Gao, Matthew M Engelhard\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.\",\"PeriodicalId\":74541,\"journal\":{\"name\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"volume\":\"2025 \",\"pages\":\"30263-30283\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12357761/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. Meeting","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于变压器的模型在文档分类方面取得了最先进的性能，但由于自关注模块的二次计算复杂度，在长文本处理方面存在困难。现有的解决方案，如稀疏注意、分层模型和关键句子提取，部分地解决了这个问题，但当输入序列异常长时仍然不足。为了应对这一挑战，我们提出了IRIS（可解释的检索增强分类，用于长穿插文档序列），这是一个新颖的轻量级框架，它利用检索来有效地对长文档进行分类，同时增强了可解释性。IRIS将文档分割成块，将它们的嵌入存储在向量数据库中，并使用可学习的查询向量检索与给定任务最相关的那些。然后，线性注意力机制将检索到的嵌入聚合起来进行分类，允许模型处理任意长的文档，而不会增加计算成本，并且在单个GPU上保持可训练性。我们在六个数据集上的实验表明，IRIS在标准基准测试中达到了与基线模型相当的性能，并且在三个临床记录疾病风险预测任务中表现出色，其中文档非常长，关键信息稀疏。此外，IRIS通过揭示模型识别的关键风险因素的清晰摘要，提供了全球可解释性。这些发现突出了IRIS作为长文档分类的有效且可解释的解决方案的潜力，特别是在性能和可解释性都至关重要的医疗保健应用中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.

Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the conference. Association for Computational Linguistics. Meeting

自引率

0.00%

发文量