WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation

Proceedings of the ACM Symposium on Document Engineering 2023 Pub Date : 2023-08-22 DOI:10.1145/3573128.3604901

Amanda Dash, Melissa Cote, A. Albu

{"title":"WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation","authors":"Amanda Dash, Melissa Cote, A. Albu","doi":"10.1145/3573128.3604901","DOIUrl":null,"url":null,"abstract":"Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573128.3604901","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.

查看原文本刊更多论文

WEATHERGOV+:一个表识别和汇总数据集，以弥合文档图像分析和自然语言生成之间的差距

表格在科学论文和财务报表等面向数据的文件中无处不在，它组织和传达关系信息。文档图像的自动表识别是文档图像分析(DIA)中的一个热门研究课题，它涉及到页面内的检测、对行、列和单元的结构分割以及从单元中提取信息。随着基于深度神经网络的自然语言生成(NLG)的最新进展，数据到文本生成，特别是表摘要生成，为时间密集型数据分析提供了有趣的解决方案。在本文中，我们的目标是弥合DIA和NLG在表格数据方面的努力之间的差距:我们提出了WEATHERGOV+，这是一个建立在WEATHERGOV数据集(表格数据汇总技术的标准)之上的数据集，它允许训练和测试从输入文档图像到生成文本摘要作为输出的端到端方法。WEATHERGOV+包含从WEATHERGOV的表格数据创建的表格图像，使用涵盖不同难度级别的视觉变化，以及相应的WEATHERGOV人工生成的表格摘要。我们还提出了一个端到端管道，用于比较最先进的表识别方法以进行汇总。我们通过在管道的每个阶段评估WEATHERGOV+来分析拟议管道的结果，以确定误差传播的影响和当前方法的弱点，如OCR误差。通过这项研究(此处提供数据集和代码1)，我们希望鼓励对文档间和文档内集合的处理和管理进行新的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Symposium on Document Engineering 2023

自引率

0.00%

发文量