IEEE International Conference on Document Analysis and Recognition最新文献

Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition 手写数学表达式识别的语义图表示学习

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-21 DOI: 10.1007/978-3-031-41676-7_9

Zhuang Liu, Ye Yuan, Zhilong Ji, Jingfeng Bai, X. Bai

引用次数: 0

I-WAS: a Data Augmentation Method with GPT-2 for Simile Detection I-WAS:一种使用GPT-2进行明喻检测的数据增强方法

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-08 DOI: 10.48550/arXiv.2308.04109

Yongzhu Chang, Rongsheng Zhang, Jiashu Pu

引用次数: 0

A Graphical Approach to Document Layout Analysis 文档布局分析的图形化方法

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-03 DOI: 10.48550/arXiv.2308.02051

Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, M. Sokolov, Vadym Barda, Delphine Vendryes, Christy Tanner

{"title":"A Graphical Approach to Document Layout Analysis","authors":"Jilin Wang, Michael Krumdick, Baojia Tong, Hamima Halim, M. Sokolov, Vadym Barda, Delphine Vendryes, Christy Tanner","doi":"10.48550/arXiv.2308.02051","DOIUrl":"https://doi.org/10.48550/arXiv.2308.02051","url":null,"abstract":"Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121350259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic RealCQA:科学图表问答作为一阶逻辑的测试平台

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-03 DOI: 10.48550/arXiv.2308.01979

Saleem Ahmed, Bhavin Jawade, Shubham Pandey, S. Setlur, Venugopal Govindaraju

{"title":"RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic","authors":"Saleem Ahmed, Bhavin Jawade, Shubham Pandey, S. Setlur, Venugopal Govindaraju","doi":"10.48550/arXiv.2308.01979","DOIUrl":"https://doi.org/10.48550/arXiv.2308.01979","url":null,"abstract":"We present a comprehensive study of chart visual question-answering(QA) task, to address the challenges faced in comprehending and extracting data from chart visualizations within documents. Despite efforts to tackle this problem using synthetic charts, solutions are limited by the shortage of annotated real-world data. To fill this gap, we introduce a benchmark and dataset for chart visual QA on real-world charts, offering a systematic analysis of the task and a novel taxonomy for template-based chart question creation. Our contribution includes the introduction of a new answer type, 'list', with both ranked and unranked variations. Our study is conducted on a real-world chart dataset from scientific literature, showcasing higher visual complexity compared to other works. Our focus is on template-based QA and how it can serve as a standard for evaluating the first-order logic capabilities of models. The results of our experiments, conducted on a real-world out-of-distribution dataset, provide a robust evaluation of large-scale pre-trained models and advance the field of chart visual QA and formal logic verification for neural networks in general.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":" 70","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120829647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding SpaDen:用于真实世界图表理解的稀疏和密集关键点估计

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-08-03 DOI: 10.48550/arXiv.2308.01971

Saleem Ahmed, Pengyu Yan, D. Doermann, S. Setlur, Venugopal Govindaraju

{"title":"SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding","authors":"Saleem Ahmed, Pengyu Yan, D. Doermann, S. Setlur, Venugopal Govindaraju","doi":"10.48550/arXiv.2308.01971","DOIUrl":"https://doi.org/10.48550/arXiv.2308.01971","url":null,"abstract":"We introduce a novel bottom-up approach for the extraction of chart data. Our model utilizes images of charts as inputs and learns to detect keypoints (KP), which are used to reconstruct the components within the plot area. Our novelty lies in detecting a fusion of continuous and discrete KP as predicted heatmaps. A combination of sparse and dense per-pixel objectives coupled with a uni-modal self-attention-based feature-fusion layer is applied to learn KP embeddings. Further leveraging deep metric learning for unsupervised clustering, allows us to segment the chart plot area into various objects. By further matching the chart components to the legend, we are able to obtain the data series names. A post-processing threshold is applied to the KP embeddings to refine the object reconstructions and improve accuracy. Our extensive experiments include an evaluation of different modules for KP estimation and the combination of deep layer aggregation and corner pooling approaches. The results of our experiments provide extensive evaluation for the task of real-world chart data extraction.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121129621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reading Between the Lanes: Text VideoQA on the Road 车道间阅读:道路上的文本视频qa

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-07-08 DOI: 10.48550/arXiv.2307.03948

George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C.V. Jawahar

{"title":"Reading Between the Lanes: Text VideoQA on the Road","authors":"George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C.V. Jawahar","doi":"10.48550/arXiv.2307.03948","DOIUrl":"https://doi.org/10.48550/arXiv.2307.03948","url":null,"abstract":"Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of $3,222$ driving videos collected from multiple countries, annotated with $10,500$ questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering. The dataset is available at http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120962146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Line Graphics Digitization: A Step Towards Full Automation 直线图形数字化:迈向完全自动化的一步

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-07-05 DOI: 10.48550/arXiv.2307.02065

Omar Moured, Jiaming Zhang, Alina Roitberg, Thorsten Schwarz, R. Stiefelhagen

引用次数: 0

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents UTRNet:打印文档中的高分辨率乌尔都语文本识别

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-06-27 DOI: 10.1007/978-3-031-41734-4_19

Abdur Rahman, Arjun Ghosh, Chetan Arora

引用次数: 0

Ambigram Generation by A Diffusion Model 扩散模型的双义图生成

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-06-21 DOI: 10.48550/arXiv.2306.12049

T. Shirakawa, Seiichi Uchida

{"title":"Ambigram Generation by A Diffusion Model","authors":"T. Shirakawa, Seiichi Uchida","doi":"10.48550/arXiv.2306.12049","DOIUrl":"https://doi.org/10.48550/arXiv.2306.12049","url":null,"abstract":"Ambigrams are graphical letter designs that can be read not only from the original direction but also from a rotated direction (especially with 180 degrees). Designing ambigrams is difficult even for human experts because keeping their dual readability from both directions is often difficult. This paper proposes an ambigram generation model. As its generation module, we use a diffusion model, which has recently been used to generate high-quality photographic images. By specifying a pair of letter classes, such as 'A' and 'B', the proposed model generates various ambigram images which can be read as 'A' from the original direction and 'B' from a direction rotated 180 degrees. Quantitative and qualitative analyses of experimental results show that the proposed model can generate high-quality and diverse ambigrams. In addition, we define ambigramability, an objective measure of how easy it is to generate ambigrams for each letter pair. For example, the pair of 'A' and 'V' shows a high ambigramability (that is, it is easy to generate their ambigrams), and the pair of 'D' and 'K' shows a lower ambigramability. The ambigramability gives various hints of the ambigram generation not only for computers but also for human experts. The code can be found at (https://github.com/univ-esuty/ambifusion).","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images ICDAR 2023从视觉丰富的文档图像中提取结构化文本竞赛

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-06-05 DOI: 10.48550/arXiv.2306.03287

Wenwen Yu, Chengquan Zhang, H. Cao, W. Hua, Bohan Li, Huang-wei Chen, Ming Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yu Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Chengxi Liu, Jiebo Luo, Shuicheng Yan, M. Zhang, Dimosthenis Karatzas, Xingchao Sun, Jingdong Wang, Xiang Bai

{"title":"ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images","authors":"Wenwen Yu, Chengquan Zhang, H. Cao, W. Hua, Bohan Li, Huang-wei Chen, Ming Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yu Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Chengxi Liu, Jiebo Luo, Shuicheng Yan, M. Zhang, Dimosthenis Karatzas, Xingchao Sun, Jingdong Wang, Xiang Bai","doi":"10.48550/arXiv.2306.03287","DOIUrl":"https://doi.org/10.48550/arXiv.2306.03287","url":null,"abstract":"Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extraction from Visually-Rich Document images (SVRD). We set up two tracks for SVRD including Track 1: HUST-CELL and Track 2: Baidu-FEST, where HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling, and Baidu-FEST focuses on evaluating the performance and generalization of Zero-shot / Few-shot Structured Text extraction from an end-to-end perspective. Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, and submission summaries. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios. It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128054074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1