Data Extraction via Semantic Regular Expression Synthesis

IF 2.2 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Qiaochu Chen, Arko Banerjee, Çağatay Demiralp, Greg Durrett, Işıl Dillig
{"title":"Data Extraction via Semantic Regular Expression Synthesis","authors":"Qiaochu Chen, Arko Banerjee, Çağatay Demiralp, Greg Durrett, Işıl Dillig","doi":"10.1145/3622863","DOIUrl":null,"url":null,"abstract":"Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.","PeriodicalId":20697,"journal":{"name":"Proceedings of the ACM on Programming Languages","volume":"77 1","pages":"0"},"PeriodicalIF":2.2000,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3622863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 2

Abstract

Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.
基于语义正则表达式合成的数据提取
许多实际相关的数据提取任务不仅需要语法模式匹配,还需要对底层文本的内容进行语义推理。虽然正则表达式非常适合只需要语法模式匹配的任务,但对于同时涉及语法和语义组件的数据提取任务来说,它们就显得有些不足了。为了解决这个问题,我们引入了语义正则表达式,这是正则表达式的一种泛化,有助于对文本数据进行语法和语义推理的结合。我们还提出了一种新的学习算法,可以从少量的正反例中合成语义正则。我们提出的学习算法使用神经草图生成和组合类型导向合成相结合的方法,从少量示例中快速有效地泛化。我们在一个名为Smore的新工具中实现了这些想法,并在涉及多个文本数据集的代表性数据提取任务上对其进行了评估。我们的评估表明,与标准正则表达式相比,语义正则表达式可以更好地支持复杂的数据提取任务,并且我们的学习算法明显优于现有的工具,包括最先进的神经网络和程序合成工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages Engineering-Safety, Risk, Reliability and Quality
CiteScore
5.20
自引率
22.20%
发文量
192
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信