Three approaches to "industrial" table spotting

Proceedings of Sixth International Conference on Document Analysis and Recognition Pub Date : 2001-09-10 DOI:10.1109/ICDAR.2001.953842

B. Klein, Serdar Gökkus, T. Kieninger, A. Dengel

{"title":"Three approaches to \"industrial\" table spotting","authors":"B. Klein, Serdar Gökkus, T. Kieninger, A. Dengel","doi":"10.1109/ICDAR.2001.953842","DOIUrl":null,"url":null,"abstract":"This paper introduces three approaches for an industrial, comprehensive document analysis system to enable it to spot tables in documents. Searching for a set of known table headers (approach 1) works rather well in a significant number of documents. But this approach (though it is implemented tolerant to OCR errors) is not tolerant enough towards some kinds of even minor aberrations. This not only decreases the recognition results, but also, even worse, makes users feel uncomfortable. Pragmatically trying to mimic for what the human eyes might key, leads to our two further, complementary approaches: searching for layout structures which resemble parts of columns (approach 2), and searching for groupings of similar lines (approach 3). The suitability of the approaches for our system requires them to be very simple to implement and simple to explain to users, computationally cheap, and combinable. In the domain of health insurances who receive huge amounts of so called medical liquidations on a daily basis we obtain very good results. On document samples representative for the every day practice of five customers-health insurance companies-tables were spotted as good and as fast as the customers expected the system to be. We thus consider our current approaches as a step towards cognitive adequacy.","PeriodicalId":277816,"journal":{"name":"Proceedings of Sixth International Conference on Document Analysis and Recognition","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of Sixth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2001.953842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

This paper introduces three approaches for an industrial, comprehensive document analysis system to enable it to spot tables in documents. Searching for a set of known table headers (approach 1) works rather well in a significant number of documents. But this approach (though it is implemented tolerant to OCR errors) is not tolerant enough towards some kinds of even minor aberrations. This not only decreases the recognition results, but also, even worse, makes users feel uncomfortable. Pragmatically trying to mimic for what the human eyes might key, leads to our two further, complementary approaches: searching for layout structures which resemble parts of columns (approach 2), and searching for groupings of similar lines (approach 3). The suitability of the approaches for our system requires them to be very simple to implement and simple to explain to users, computationally cheap, and combinable. In the domain of health insurances who receive huge amounts of so called medical liquidations on a daily basis we obtain very good results. On document samples representative for the every day practice of five customers-health insurance companies-tables were spotted as good and as fast as the customers expected the system to be. We thus consider our current approaches as a step towards cognitive adequacy.

查看原文本刊更多论文

三种“工业”餐桌定位方法

本文介绍了一个工业综合文件分析系统的三种方法，使其能够在文件中发现表格。搜索一组已知的表头(方法1)在很多文档中都能很好地工作。但是这种方法(尽管它实现了对OCR错误的容忍)对某些类型的甚至很小的畸变的容忍度不够。这不仅降低了识别效果，而且更糟糕的是，让用户感到不舒服。务实地尝试模仿人眼可能关注的内容，导致了我们进一步的两种互补方法:搜索与列部分相似的布局结构(方法2)，以及搜索相似线的分组(方法3)。方法对我们系统的适用性要求它们非常简单，易于实现，易于向用户解释，计算成本低，并且可组合。在健康保险领域，每天收到大量所谓的医疗清算，我们取得了很好的成果。在代表五个客户(健康保险公司)日常实践的文件样本中，发现表格与客户期望的系统一样好，一样快。因此，我们认为我们目前的方法是迈向认知充分性的一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of Sixth International Conference on Document Analysis and Recognition

自引率

0.00%

发文量