International Journal on Document Analysis and Recognition最新文献

Tabular context-aware optical character recognition and tabular data reconstruction for historical records. 表格上下文感知光学字符识别和历史记录的表格数据重建。

IF 2.5 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2025-01-01 Epub Date: 2025-07-01 DOI: 10.1007/s10032-025-00543-9

Loitongbam Gyanendro Singh, Stuart E Middleton

{"title":"Tabular context-aware optical character recognition and tabular data reconstruction for historical records.","authors":"Loitongbam Gyanendro Singh, Stuart E Middleton","doi":"10.1007/s10032-025-00543-9","DOIUrl":"10.1007/s10032-025-00543-9","url":null,"abstract":"Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS_Data_Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel context-aware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5, combining OCR and post-OCR correction in a unified training framework. This framework enables the system to produce both the raw OCR output and a corrected version in a single pass, improving recognition accuracy, particularly for multilingual and degraded text, within complex table digitization tasks. The model achieves superior performance with a 0.049 word error rate and a 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"28 3","pages":"357-376"},"PeriodicalIF":2.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12450121/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145126180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A survey on artificial intelligence-based approaches for personality analysis from handwritten documents 基于人工智能的手写文件个性分析方法调查

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-08-27 DOI: 10.1007/s10032-024-00496-5

Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy

引用次数: 0

In-domain versus out-of-domain transfer learning for document layout analysis 文件布局分析中的域内与域外迁移学习

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-08-19 DOI: 10.1007/s10032-024-00497-4

Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi

{"title":"In-domain versus out-of-domain transfer learning for document layout analysis","authors":"Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi","doi":"10.1007/s10032-024-00497-4","DOIUrl":"https://doi.org/10.1007/s10032-024-00497-4","url":null,"abstract":"Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset 基于深度学习的修改后 EAST 场景文本检测器：从新型多脚本数据集中获得的启示

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-31 DOI: 10.1007/s10032-024-00491-w

Shilpa Mahajan, Rajneesh Rani, Aman Kamboj

{"title":"Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset","authors":"Shilpa Mahajan, Rajneesh Rani, Aman Kamboj","doi":"10.1007/s10032-024-00491-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00491-w","url":null,"abstract":"The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"50 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection 实现施工图的全自动处理和分析：人工智能驱动的符号检测

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-25 DOI: 10.1007/s10032-024-00492-9

Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan

{"title":"Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection","authors":"Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan","doi":"10.1007/s10032-024-00492-9","DOIUrl":"https://doi.org/10.1007/s10032-024-00492-9","url":null,"abstract":"Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GAN-based text line segmentation method for challenging handwritten documents 基于 GAN 的文本行分割方法，适用于具有挑战性的手写文档

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-21 DOI: 10.1007/s10032-024-00488-5

İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya

引用次数: 0

Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer 利用视觉变换器综合离散余弦变换特征确定棕榈叶遗产文件的图像质量

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-17 DOI: 10.1007/s10032-024-00490-x

Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman

{"title":"Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer","authors":"Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman","doi":"10.1007/s10032-024-00490-x","DOIUrl":"https://doi.org/10.1007/s10032-024-00490-x","url":null,"abstract":"Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"49 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end semi-supervised approach with modulated object queries for table detection in documents 利用调制对象查询的端到端半监督方法检测文档中的表格

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-10 DOI: 10.1007/s10032-024-00471-0

Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

{"title":"End-to-end semi-supervised approach with modulated object queries for table detection in documents","authors":"Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal","doi":"10.1007/s10032-024-00471-0","DOIUrl":"https://doi.org/10.1007/s10032-024-00471-0","url":null,"abstract":"Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing ChemScraper：利用 PDF 图形指令进行分子图解析

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-07-05 DOI: 10.1007/s10032-024-00486-7

Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi

{"title":"ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing","authors":"Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi","doi":"10.1007/s10032-024-00486-7","DOIUrl":"https://doi.org/10.1007/s10032-024-00486-7","url":null,"abstract":"Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"5 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring recursive neural networks for compact handwritten text recognition models 探索用于紧凑型手写文本识别模型的递归神经网络

IF 2.3 4区计算机科学

International Journal on Document Analysis and Recognition Pub Date : 2024-06-27 DOI: 10.1007/s10032-024-00481-y

Enrique Mas-Candela, Jorge Calvo-Zaragoza

{"title":"Exploring recursive neural networks for compact handwritten text recognition models","authors":"Enrique Mas-Candela, Jorge Calvo-Zaragoza","doi":"10.1007/s10032-024-00481-y","DOIUrl":"https://doi.org/10.1007/s10032-024-00481-y","url":null,"abstract":"This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"48 14 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0