{"title":"WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation","authors":"Amanda Dash, Melissa Cote, A. Albu","doi":"10.1145/3573128.3604901","DOIUrl":null,"url":null,"abstract":"Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.","PeriodicalId":310776,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering 2023","volume":"35 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573128.3604901","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.