基于图像的表结构识别的大型完整数据集的自动构建

Tran Quang Vinh, Nguyen Thi Ngoc Diep
{"title":"基于图像的表结构识别的大型完整数据集的自动构建","authors":"Tran Quang Vinh, Nguyen Thi Ngoc Diep","doi":"10.25073/2588-1086/vnucsce.293","DOIUrl":null,"url":null,"abstract":"Table is one of the most common ways to represent structured data in documents. Existing researches on image-based table structure recognition often rely on limited datasets with the largest amount of 3,789 human-labeled tables as ICDAR 19 Track B dataset. A recent TableBank dataset for table structures contains 145K tables, however, the tables are labeled in an HTML tag sequence format, which impedes the development of image-based recognition methods. In this paper, we propose several processing methods that automatically convert an HTML tag sequence annotation into bounding box annotation for table cells in one table image. By ensembling these methods, we could convert 42,028 tables with high correctness, which is 11 times larger than the largest existing dataset (ICDAR 19). We then demonstrate that using these bounding box annotations, a straightforward representation of objects in images, we can achieve much higher F1-scores of table structure recognition at many high IoU thresholds using only off-the-shelf deep learning models: F1-score of 0.66 compared to the state-of-the-art of 0.44 for ICDAR19 dataset. A further experiment on using explicit bounding box annotation for image-based table structure recognition results in higher accuracy (70.6%) than implicit text sequence annotation (only 33.8%). The experiments show the effectiveness of our largest-to-date dataset to open up opportunities to generalize on real-world applications. Our dataset and experimental models are publicly available at shorturl.at/hwHY3","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"455 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic building of a large and complete dataset for image-based table structure recognition\",\"authors\":\"Tran Quang Vinh, Nguyen Thi Ngoc Diep\",\"doi\":\"10.25073/2588-1086/vnucsce.293\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Table is one of the most common ways to represent structured data in documents. Existing researches on image-based table structure recognition often rely on limited datasets with the largest amount of 3,789 human-labeled tables as ICDAR 19 Track B dataset. A recent TableBank dataset for table structures contains 145K tables, however, the tables are labeled in an HTML tag sequence format, which impedes the development of image-based recognition methods. In this paper, we propose several processing methods that automatically convert an HTML tag sequence annotation into bounding box annotation for table cells in one table image. By ensembling these methods, we could convert 42,028 tables with high correctness, which is 11 times larger than the largest existing dataset (ICDAR 19). We then demonstrate that using these bounding box annotations, a straightforward representation of objects in images, we can achieve much higher F1-scores of table structure recognition at many high IoU thresholds using only off-the-shelf deep learning models: F1-score of 0.66 compared to the state-of-the-art of 0.44 for ICDAR19 dataset. A further experiment on using explicit bounding box annotation for image-based table structure recognition results in higher accuracy (70.6%) than implicit text sequence annotation (only 33.8%). The experiments show the effectiveness of our largest-to-date dataset to open up opportunities to generalize on real-world applications. Our dataset and experimental models are publicly available at shorturl.at/hwHY3\",\"PeriodicalId\":416488,\"journal\":{\"name\":\"VNU Journal of Science: Computer Science and Communication Engineering\",\"volume\":\"455 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"VNU Journal of Science: Computer Science and Communication Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25073/2588-1086/vnucsce.293\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"VNU Journal of Science: Computer Science and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25073/2588-1086/vnucsce.293","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

表是在文档中表示结构化数据的最常用方法之一。现有的基于图像的表结构识别研究往往依赖于有限的数据集,如ICDAR 19 Track B数据集,数量最多的是3789张人工标记表。最近用于表结构的TableBank数据集包含145K个表,但是,这些表以HTML标记序列格式进行标记,这阻碍了基于图像的识别方法的开发。在本文中,我们提出了几种处理方法,将HTML标记序列注释自动转换为一个表格图像中表格单元格的边界框注释。通过集成这些方法,我们可以以高正确性转换42,028个表,这是现有最大数据集(ICDAR 19)的11倍。然后,我们证明使用这些边界框注释(图像中对象的直接表示),我们可以仅使用现成的深度学习模型在许多高IoU阈值下实现更高的表结构识别f1分数:f1分数为0.66,而ICDAR19数据集的最新水平为0.44。在进一步的实验中,使用显式边界框标注进行基于图像的表结构识别,准确率(70.6%)高于隐式文本序列标注(33.8%)。实验显示了我们迄今为止最大的数据集的有效性,为推广现实世界的应用提供了机会。我们的数据集和实验模型可以在shorturl.at/hwHY3上公开获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automatic building of a large and complete dataset for image-based table structure recognition
Table is one of the most common ways to represent structured data in documents. Existing researches on image-based table structure recognition often rely on limited datasets with the largest amount of 3,789 human-labeled tables as ICDAR 19 Track B dataset. A recent TableBank dataset for table structures contains 145K tables, however, the tables are labeled in an HTML tag sequence format, which impedes the development of image-based recognition methods. In this paper, we propose several processing methods that automatically convert an HTML tag sequence annotation into bounding box annotation for table cells in one table image. By ensembling these methods, we could convert 42,028 tables with high correctness, which is 11 times larger than the largest existing dataset (ICDAR 19). We then demonstrate that using these bounding box annotations, a straightforward representation of objects in images, we can achieve much higher F1-scores of table structure recognition at many high IoU thresholds using only off-the-shelf deep learning models: F1-score of 0.66 compared to the state-of-the-art of 0.44 for ICDAR19 dataset. A further experiment on using explicit bounding box annotation for image-based table structure recognition results in higher accuracy (70.6%) than implicit text sequence annotation (only 33.8%). The experiments show the effectiveness of our largest-to-date dataset to open up opportunities to generalize on real-world applications. Our dataset and experimental models are publicly available at shorturl.at/hwHY3
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信