DocScanner: Robust Document Image Rectification with Progressive Learning

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-05-26 DOI:10.1007/s11263-025-02431-5

Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li

{"title":"DocScanner: Robust Document Image Rectification with Progressive Learning","authors":"Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li","doi":"10.1007/s11263-025-02431-5","DOIUrl":null,"url":null,"abstract":"<p>Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"40 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02431-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.

查看原文本刊更多论文

DocScanner：具有渐进式学习的鲁棒文档图像校正

与平板扫描仪相比，便携式智能手机为物理文档数字化提供了更多便利。然而，由于不受控制的物理变形、相机位置和照明变化，这种数字化文档经常会失真。为此，我们提出了一种新的文档图像校正框架DocScanner。与现有的解决方案不同，DocScanner通过引入渐进式学习机制来解决这个问题。具体来说，DocScanner保持对校正图像的单一估计，并使用循环架构逐步进行校正。迭代式的改进使DocScanner收敛于强大而优越的整流性能，而轻量级的循环架构确保了运行效率。为了进一步提高校正质量，在训练过程中，基于扭曲图像和校正图像之间的几何先验，引入几何约束，进一步提高校正性能。在Doc3D数据集和DocUNet基准数据集上进行了大量的实验，定量和定性的评估结果验证了DocScanner的有效性，它在OCR精度、图像相似度和我们提出的失真度量方面都大大优于以前的方法。此外，我们的DocScanner在运行时延迟和模型大小方面表现出卓越的效率。代码和预训练模型可在https://github.com/fh2019ustc/DocScanner上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.