表结构识别的优化表标记化

IEEE International Conference on Document Analysis and Recognition Pub Date : 2023-05-05 DOI:10.48550/arXiv.2305.03393

Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar

{"title":"表结构识别的优化表标记化","authors":"Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar","doi":"10.48550/arXiv.2305.03393","DOIUrl":null,"url":null,"abstract":"Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimized Table Tokenization for Table Structure Recognition\",\"authors\":\"Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, P. Staar\",\"doi\":\"10.48550/arXiv.2305.03393\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.\",\"PeriodicalId\":294655,\"journal\":{\"name\":\"IEEE International Conference on Document Analysis and Recognition\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2305.03393\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2305.03393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

从文档中提取表是任何文档转换管道中的关键任务。最近，基于变压器的模型已经证明，使用图像到标记序列(Im2Seq)方法可以以令人印象深刻的精度识别表结构。仅取表的图像，这样的模型预测一系列表示表结构的令牌(例如在HTML、LaTeX中)。由于表结构的令牌表示对任何Im2Seq模型的准确性和运行时性能都有重大影响，因此我们在本文中研究了如何优化表结构表示。我们提出了一种新的优化表结构语言(OTSL)，它具有最小化的词汇和特定的规则。OTSL的好处是它将令牌的数量减少到5个(HTML需要28个以上)，并将序列长度平均缩短到HTML的一半。因此，模型精度显著提高，与基于html的模型相比，推理时间缩短了一半，预测的表结构在语法上总是正确的。这反过来又消除了大多数后处理需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimized Table Tokenization for Table Structure Recognition

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量