Online template matching over a stream of digitized documents

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI:10.1145/2791347.2791354

M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel

{"title":"Online template matching over a stream of digitized documents","authors":"M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel","doi":"10.1145/2791347.2791354","DOIUrl":null,"url":null,"abstract":"Although living in the information age for decades, paperwork is still a tedious part of everybody's life. Assistance systems that implement techniques of digitization and document understanding may offer considerable reductions in time and effort for the users. A large portion of paper documents like invoices, delivery receipts or admonitions are based on a fixed company specific template and therefore exhibit a high degree of similarity. In this work, we propose a template extraction method over a stream of incoming documents and a template allocation method for assigning new instances from the stream to the most suitable templates. Our method employs text augmented by layout information to represent the digital image of the paper document. Document similarity is assessed with respect to both textual and layout parts of the document; the matching terms contribute accordingly to their distance to the query terms. To be more robust against distortions on the documents due to the digitization process, the templates are not static, rather they are maintained in an online fashion based on their new assigned documents. Real data experiments show that the combination of textual and layout information and the continuous template adaptation through online update, improves the template identification quality of earlier proposed methods.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2791347.2791354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Although living in the information age for decades, paperwork is still a tedious part of everybody's life. Assistance systems that implement techniques of digitization and document understanding may offer considerable reductions in time and effort for the users. A large portion of paper documents like invoices, delivery receipts or admonitions are based on a fixed company specific template and therefore exhibit a high degree of similarity. In this work, we propose a template extraction method over a stream of incoming documents and a template allocation method for assigning new instances from the stream to the most suitable templates. Our method employs text augmented by layout information to represent the digital image of the paper document. Document similarity is assessed with respect to both textual and layout parts of the document; the matching terms contribute accordingly to their distance to the query terms. To be more robust against distortions on the documents due to the digitization process, the templates are not static, rather they are maintained in an online fashion based on their new assigned documents. Real data experiments show that the combination of textual and layout information and the continuous template adaptation through online update, improves the template identification quality of earlier proposed methods.

查看原文本刊更多论文

在线模板匹配在一个流的数字化文件

虽然生活在信息时代几十年了，文书工作仍然是每个人生活中乏味的一部分。实施数字化和文档理解技术的辅助系统可以为用户节省大量的时间和精力。大部分纸质文件，如发票、送货收据或警告，都是基于固定的公司特定模板，因此表现出高度的相似性。在这项工作中，我们提出了一种针对传入文档流的模板提取方法，以及一种将流中的新实例分配给最合适模板的模板分配方法。我们的方法采用布局信息增强的文本来表示纸质文档的数字图像。文档相似性是根据文档的文本部分和布局部分来评估的;匹配项对查询项的距离有相应的贡献。为了更健壮地防止由于数字化过程导致的文档失真，模板不是静态的，而是基于新分配的文档以在线方式维护。实际数据实验表明，文本和版面信息的结合以及通过在线更新的模板自适应，提高了之前提出的模板识别方法的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 27th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量