DocXclassifier: towards a robust and interpretable deep neural network for document image classification

IF 2.5 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition Pub Date : 2024-06-25 DOI:10.1007/s10032-024-00483-w

Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed

{"title":"DocXclassifier: towards a robust and interpretable deep neural network for document image classification","authors":"Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed","doi":"10.1007/s10032-024-00483-w","DOIUrl":null,"url":null,"abstract":"<p>Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"140 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Document Analysis and Recognition","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10032-024-00483-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.

Abstract Image

查看原文本刊更多论文

DocXclassifier：为文档图像分类开发鲁棒且可解释的深度神经网络

如今，模型的可解释性和鲁棒性对于在工业环境中安全、实用地部署深度学习（DL）模型越来越重要。随着由深度学习支持的自动文档处理系统在业务工作流程中变得越来越常见，如今迫切需要提高文档图像分类任务的可解释性和鲁棒性，这是此类系统不可或缺的组成部分。令人惊讶的是，虽然很多研究都致力于提高深度模型在这项任务中的性能，但却很少关注它们的可解释性和鲁棒性。在本文中，我们旨在改进这两个方面，并引入了两个本质上可解释的深度文档分类器--DocXClassifier 和 DocXClassifierFPN，这两个分类器不仅在性能上比现有方法有了显著提高，而且还能在进行预测的同时生成特征重要性图。我们的方法是将卷积神经网络（ConvNet）骨干网与注意力机制相结合，根据特征对类别的重要性对其进行加权聚合，从而生成可解释的重要性图。此外，我们还建议将特征金字塔网络与注意力机制相结合，以显著提高可解释性地图的分辨率，尤其是对于金字塔型 ConvNet 架构而言。在 RVL-CDIP 和 Tobacco3482 这两个流行的文档数据集上，我们的方法在基于图像的分类方面取得了最先进的性能，最高分类准确率分别为 94.19% 和 95.71%。此外，在没有从 RVL-CDIP 转移学习的情况下，它在 Tobacco3482 上的图像分类准确率达到了 90.29%，创造了新的最高记录。此外，与现有方法相比，我们提出的训练策略表现出了卓越的鲁棒性，在 21 种不同类型的新数据失真中，有 19 种明显优于现有方法，而在其余两种失真中也取得了相当的结果。通过将鲁棒性与可解释性相结合，DocXClassifier 向实际部署用于文档分类任务的 DL 模型迈出了充满希望的一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal on Document Analysis and Recognition 工程技术-计算机：人工智能

CiteScore

6.20

自引率

4.30%

发文量

审稿时长

7.5 months

期刊介绍： The large number of existing documents and the production of a multitude of new ones every year raise important issues in efficient handling, retrieval and storage of these documents and the information which they contain. This has led to the emergence of new research domains dealing with the recognition by computers of the constituent elements of documents - including characters, symbols, text, lines, graphics, images, handwriting, signatures, etc. In addition, these new domains deal with automatic analyses of the overall physical and logical structures of documents, with the ultimate objective of a high-level understanding of their semantic content. We have also seen renewed interest in optical character recognition (OCR) and handwriting recognition during the last decade. Document analysis and recognition are obviously the next stage. Automatic, intelligent processing of documents is at the intersections of many fields of research, especially of computer vision, image analysis, pattern recognition and artificial intelligence, as well as studies on reading, handwriting and linguistics. Although quality document related publications continue to appear in journals dedicated to these domains, the community will benefit from having this journal as a focal point for archival literature dedicated to document analysis and recognition.