基于语义正则表达式合成的数据提取

IF 2.8 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Proceedings of the ACM on Programming Languages Pub Date : 2023-10-16 DOI:10.1145/3622863

Qiaochu Chen, Arko Banerjee, Çağatay Demiralp, Greg Durrett, Işıl Dillig

{"title":"基于语义正则表达式合成的数据提取","authors":"Qiaochu Chen, Arko Banerjee, Çağatay Demiralp, Greg Durrett, Işıl Dillig","doi":"10.1145/3622863","DOIUrl":null,"url":null,"abstract":"Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.","PeriodicalId":20697,"journal":{"name":"Proceedings of the ACM on Programming Languages","volume":"77 1","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Data Extraction via Semantic Regular Expression Synthesis\",\"authors\":\"Qiaochu Chen, Arko Banerjee, Çağatay Demiralp, Greg Durrett, Işıl Dillig\",\"doi\":\"10.1145/3622863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.\",\"PeriodicalId\":20697,\"journal\":{\"name\":\"Proceedings of the ACM on Programming Languages\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2023-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM on Programming Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3622863\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3622863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 2

摘要

许多实际相关的数据提取任务不仅需要语法模式匹配，还需要对底层文本的内容进行语义推理。虽然正则表达式非常适合只需要语法模式匹配的任务，但对于同时涉及语法和语义组件的数据提取任务来说，它们就显得有些不足了。为了解决这个问题，我们引入了语义正则表达式，这是正则表达式的一种泛化，有助于对文本数据进行语法和语义推理的结合。我们还提出了一种新的学习算法，可以从少量的正反例中合成语义正则。我们提出的学习算法使用神经草图生成和组合类型导向合成相结合的方法，从少量示例中快速有效地泛化。我们在一个名为Smore的新工具中实现了这些想法，并在涉及多个文本数据集的代表性数据提取任务上对其进行了评估。我们的评估表明，与标准正则表达式相比，语义正则表达式可以更好地支持复杂的数据提取任务，并且我们的学习算法明显优于现有的工具，包括最先进的神经网络和程序合成工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Extraction via Semantic Regular Expression Synthesis

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM on Programming Languages Engineering-Safety, Risk, Reliability and Quality

CiteScore

5.20

自引率

22.20%

发文量

192