地标和区域:一种健壮的数据提取方法

Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation Pub Date : 2022-04-11 DOI:10.1145/3519939.3523705

Suresh Parthasarathy, Lincy Pattanaik, Anirudh Khatry, Arun Shankar Iyer, Arjun Radhakrishna, S. Rajamani, Mohammad Raza

{"title":"地标和区域:一种健壮的数据提取方法","authors":"Suresh Parthasarathy, Lincy Pattanaik, Anirudh Khatry, Arun Shankar Iyer, Arjun Radhakrishna, S. Rajamani, Mohammad Raza","doi":"10.1145/3519939.3523705","DOIUrl":null,"url":null,"abstract":"We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to for- mat changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark based extraction approach in a tool LRSyn, and show extensive valuation on documents in HTML as well as scanned images of invoices and receipts. Our results show that the our approach is robust to various types of format changes that routinely happen in real-world settings","PeriodicalId":140942,"journal":{"name":"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Landmarks and regions: a robust approach to data extraction\",\"authors\":\"Suresh Parthasarathy, Lincy Pattanaik, Anirudh Khatry, Arun Shankar Iyer, Arjun Radhakrishna, S. Rajamani, Mohammad Raza\",\"doi\":\"10.1145/3519939.3523705\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to for- mat changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark based extraction approach in a tool LRSyn, and show extensive valuation on documents in HTML as well as scanned images of invoices and receipts. Our results show that the our approach is robust to various types of format changes that routinely happen in real-world settings\",\"PeriodicalId\":140942,\"journal\":{\"name\":\"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3519939.3523705\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3519939.3523705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一种从半结构化文档中提取数据项或字段值的新方法。这类问题的例子包括从旅行行程中提取乘客姓名、出发时间和出发机场，或者从购买收据中提取商品价格。传统的数据提取方法使用机器学习或程序合成来处理整个文档以提取所需的字段。这种方法对于文档中的临时更改并不健壮，并且即使对文档中与期望的感兴趣的字段无关的部分进行了更改，提取过程通常也会失败。提出了一种基于地标和区域概念的数据提取方法。在手动处理文档时，人们通常使用地标来放大并将注意力集中在文档中感兴趣的小区域上。受这种人类直觉的启发，我们在程序合成中使用地标的概念来自动合成提取程序，首先提取感兴趣的小区域，然后在后续步骤中自动从该区域提取所需值。我们已经在LRSyn工具中实现了基于里程碑的提取方法，并在HTML文档以及发票和收据的扫描图像中显示了广泛的估值。我们的结果表明，我们的方法对于现实世界中经常发生的各种类型的格式更改是健壮的

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Landmarks and regions: a robust approach to data extraction

We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to for- mat changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark based extraction approach in a tool LRSyn, and show extensive valuation on documents in HTML as well as scanned images of invoices and receipts. Our results show that the our approach is robust to various types of format changes that routinely happen in real-world settings

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

自引率

0.00%

发文量