IEEE International Conference on Document Analysis and Recognition最新文献_第2页

Sampling and Ranking for Digital Ink Generation on a tight computational budget 在计算预算紧张的情况下，数字墨水生成的抽样和排序

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-06-02 DOI: 10.48550/arXiv.2306.03103

A. Afonin, Andrii Maksai, A. Timofeev, C. Musat

{"title":"Sampling and Ranking for Digital Ink Generation on a tight computational budget","authors":"A. Afonin, Andrii Maksai, A. Timofeev, C. Musat","doi":"10.48550/arXiv.2306.03103","DOIUrl":"https://doi.org/10.48550/arXiv.2306.03103","url":null,"abstract":"Digital ink (online handwriting) generation has a number of potential applications for creating user-visible content, such as handwriting autocompletion, spelling correction, and beautification. Writing is personal and usually the processing is done on-device. Ink generative models thus need to produce high quality content quickly, in a resource constrained environment. In this work, we study ways to maximize the quality of the output of a trained digital ink generative model, while staying within an inference time budget. We use and compare the effect of multiple sampling and ranking techniques, in the first ablation study of its kind in the digital ink domain. We confirm our findings on multiple datasets - writing in English and Vietnamese, as well as mathematical formulas - using two model types and two common ink data representations. In all combinations, we report a meaningful improvement in the recognizability of the synthetic inks, in some cases more than halving the character error rate metric, and describe a way to select the optimal combination of sampling and ranking techniques for any given computational budget.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125427968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition 基于去噪扩散概率模型的训练数据零射击生成手写体汉字识别

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-25 DOI: 10.48550/arXiv.2305.15660

Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo

{"title":"Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition","authors":"Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo","doi":"10.48550/arXiv.2305.15660","DOIUrl":"https://doi.org/10.48550/arXiv.2305.15660","url":null,"abstract":"There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many training samples need be collected for each character category, which is both time-consuming and expensive. In this paper, we propose a novel approach to transforming Chinese character glyph images generated from font libraries to handwritten ones with a denoising diffusion probabilistic model (DDPM). Training from handwritten samples of a small character set, the DDPM is capable of mapping printed strokes to handwritten ones, which makes it possible to generate photo-realistic and diverse style handwritten samples of unseen character categories. Combining DDPM-synthesized samples of unseen categories with real samples of other categories, we can build an HCCR system to support the full character set. Experimental results on CASIA-HWDB dataset with 3,755 character categories show that the HCCR systems trained with synthetic samples perform similarly with the one trained with real samples in terms of recognition accuracy. The proposed method has the potential to address HCCR with a larger vocabulary.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125976096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution 野外视觉信息提取:实用数据集和端到端解决方案

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-12 DOI: 10.48550/arXiv.2305.07498

Jianfeng Kuang, W. Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, Xiang Bai

{"title":"Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution","authors":"Jianfeng Kuang, W. Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Yu Zhou, Xiang Bai","doi":"10.48550/arXiv.2305.07498","DOIUrl":"https://doi.org/10.48550/arXiv.2305.07498","url":null,"abstract":"Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures, background disturbs, and entity categories, they cannot fully reveal the challenges of real-world applications. In this paper, we propose a large-scale dataset consisting of camera images for VIE, which contains not only the larger variance of layout, backgrounds, and fonts but also much more types of entities. Besides, we propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. Different from the previous end-to-end approaches that directly adopt OCR features as the input of an information extraction module, we propose to use contrastive learning to narrow the semantic gap caused by the difference between the tasks of OCR and information extraction. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities. These results demonstrate our dataset is more practical for promoting advanced VIE algorithms. In addition, experiments demonstrate that the proposed VIE method consistently achieves the obvious performance gains on the proposed and SROIE datasets.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128787295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Towards Writer Retrieval for Historical Datasets 面向历史数据集的写器检索

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-09 DOI: 10.1007/978-3-031-41676-7_24

Marco Peer, Florian Kleber, Robert Sablatnig

引用次数: 0

Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding 面向形式理解的独立于语言的神经符号语义分析

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-08 DOI: 10.48550/arXiv.2305.04460

Bhanu Prakash Voutharoja, Lizhen Qu, Fatemeh Shiri

{"title":"Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding","authors":"Bhanu Prakash Voutharoja, Lizhen Qu, Fatemeh Shiri","doi":"10.48550/arXiv.2305.04460","DOIUrl":"https://doi.org/10.48550/arXiv.2305.04460","url":null,"abstract":"Recent works on form understanding mostly employ multimodal transformers or large-scale pre-trained language models. These models need ample data for pre-training. In contrast, humans can usually identify key-value pairings from a form only by looking at layouts, even if they don't comprehend the language used. No prior research has been conducted to investigate how helpful layout information alone is for form understanding. Hence, we propose a unique entity-relation graph parsing method for scanned forms called LAGNN, a language-independent Graph Neural Network model. Our model parses a form into a word-relation graph in order to identify entities and relations jointly and reduce the time complexity of inference. This graph is then transformed by deterministic rules into a fully connected entity-relation graph. Our model simply takes into account relative spacing between bounding boxes from layout information to facilitate easy transfer across languages. To further improve the performance of LAGNN, and achieve isomorphism between entity-relation graphs and word-relation graphs, we use integer linear programming (ILP) based inference. Code is publicly available at https://github.com/Bhanu068/LAGNN","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123868161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scene Text Recognition with Image-Text Matching-guided Dictionary 基于图像-文本匹配引导字典的场景文本识别

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-08 DOI: 10.48550/arXiv.2305.04524

Jiajun Wei, Hongjian Zhan, X. Tu, Yue Lu, U. Pal

{"title":"Scene Text Recognition with Image-Text Matching-guided Dictionary","authors":"Jiajun Wei, Hongjian Zhan, X. Tu, Yue Lu, U. Pal","doi":"10.48550/arXiv.2305.04524","DOIUrl":"https://doi.org/10.48550/arXiv.2305.04524","url":null,"abstract":"Employing a dictionary can efficiently rectify the deviation between the visual prediction and the ground truth in scene text recognition methods. However, the independence of the dictionary on the visual features may lead to incorrect rectification of accurate visual predictions. In this paper, we propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network, which avoids the drawbacks of the explicit dictionary language model: 1) the independence of the visual features; 2) noisy choice in candidates etc. The SITM network accomplishes this by using Image-Text Contrastive (ITC) Learning to match an image with its corresponding text among candidates in the inference stage. ITC is widely used in vision-language learning to pull the positive image-text pair closer in feature space. Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space. Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks. Additionally, we integrate our method with ABINet and establish new state-of-the-art results on several benchmarks.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116238315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation SwinDocSegmenter:一个端到端的统一域自适应转换器，用于文档实例分割

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-08 DOI: 10.48550/arXiv.2305.04609

Ayan Banerjee, Sanket Biswas, Josep Llad'os, U. Pal

{"title":"SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation","authors":"Ayan Banerjee, Sanket Biswas, Josep Llad'os, U. Pal","doi":"10.48550/arXiv.2305.04609","DOIUrl":"https://doi.org/10.48550/arXiv.2305.04609","url":null,"abstract":"Instance-level segmentation of documents consists in assigning a class-aware and instance-aware label to each pixel of the image. It is a key step in document parsing for their understanding. In this paper, we present a unified transformer encoder-decoder architecture for en-to-end instance segmentation of complex layouts in document images. The method adapts a contrastive training with a mixed query selection for anchor initialization in the decoder. Later on, it performs a dot product between the obtained query embeddings and the pixel embedding map (coming from the encoder) for semantic reasoning. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ), and TableBank demonstrate that our model with SwinL backbone achieves better segmentation performance than the existing state-of-the-art approaches with the average precision of textbf{93.72}, textbf{54.39}, textbf{84.65} and textbf{98.04} respectively under one billion parameters. The code is made publicly available at: href{https://github.com/ayanban011/SwinDocSegmenter}{github.com/ayanban011/SwinDocSegmenter}","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133570553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Optimized Table Tokenization for Table Structure Recognition 表结构识别的优化表标记化

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-05 DOI: 10.48550/arXiv.2305.03393

Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar

引用次数: 1

How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning 如何为单笔微调选择预训练的手写识别模型

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-04 DOI: 10.48550/arXiv.2305.02593

Vittorio Pippi, S. Cascianelli, Christopher Kermorvant, R. Cucchiara

{"title":"How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning","authors":"Vittorio Pippi, S. Cascianelli, Christopher Kermorvant, R. Cucchiara","doi":"10.48550/arXiv.2305.02593","DOIUrl":"https://doi.org/10.48550/arXiv.2305.02593","url":null,"abstract":"Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is very relevant for valuable but small collections of documents preserved in historical archives, for which obtaining sufficient annotated training data is costly or, in some cases, unfeasible. To overcome this challenge, a possible solution is to pretrain HTR models on large datasets and then fine-tune them on small single-author collections. In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model. Through extensive experimental analysis, also considering the amount of fine-tuning lines, we give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117258577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer 基于变形变压器的端到端半监督表检测

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-04 DOI: 10.48550/arXiv.2305.02769

Tahira Shehzadi, K. Hashmi, D. Stricker, M. Liwicki, Muhammad Zeshan Afzal

{"title":"Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer","authors":"Tahira Shehzadi, K. Hashmi, D. Stricker, M. Liwicki, Muhammad Zeshan Afzal","doi":"10.48550/arXiv.2305.02769","DOIUrl":"https://doi.org/10.48550/arXiv.2305.02769","url":null,"abstract":"Table detection is the task of classifying and localizing table objects within document images. With the recent development in deep learning methods, we observe remarkable success in table detection. However, a significant amount of labeled data is required to train these models effectively. Many semi-supervised approaches are introduced to mitigate the need for a substantial amount of label data. These approaches use CNN-based detectors that rely on anchor proposals and post-processing stages such as NMS. To tackle these limitations, this paper presents a novel end-to-end semi-supervised table detection method that employs the deformable transformer for detecting table objects. We evaluate our semi-supervised method on PubLayNet, DocBank, ICADR-19 and TableBank datasets, and it achieves superior performance compared to previous methods. It outperforms the fully supervised method (Deformable transformer) by +3.4 points on 10% labels of TableBank-both dataset and the previous CNN-based semi-supervised approach (Soft Teacher) by +1.8 points on 10% labels of PubLayNet dataset. We hope this work opens new possibilities towards semi-supervised and unsupervised table detection methods.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114862371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3