Optimized Table Tokenization for Table Structure Recognition

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-05 DOI:10.48550/arXiv.2305.03393

Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar

{"title":"Optimized Table Tokenization for Table Structure Recognition","authors":"Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar","doi":"10.48550/arXiv.2305.03393","DOIUrl":null,"url":null,"abstract":"Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.03393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.

查看原文本刊更多论文

表结构识别的优化表标记化

从文档中提取表是任何文档转换管道中的关键任务。最近，基于变压器的模型已经证明，使用图像到标记序列(Im2Seq)方法可以以令人印象深刻的精度识别表结构。仅取表的图像，这样的模型预测一系列表示表结构的令牌(例如在HTML、LaTeX中)。由于表结构的令牌表示对任何Im2Seq模型的准确性和运行时性能都有重大影响，因此我们在本文中研究了如何优化表结构表示。我们提出了一种新的优化表结构语言(OTSL)，它具有最小化的词汇和特定的规则。OTSL的好处是它将令牌的数量减少到5个(HTML需要28个以上)，并将序列长度平均缩短到HTML的一半。因此，模型精度显著提高，与基于html的模型相比，推理时间缩短了一半，预测的表结构在语法上总是正确的。这反过来又消除了大多数后处理需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量