Constant-Delay Enumeration for Nondeterministic Document Spanners

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems Pub Date : 2021-04-14 DOI:10.1145/3436487

Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth

{"title":"Constant-Delay Enumeration for Nondeterministic Document Spanners","authors":"Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth","doi":"10.1145/3436487","DOIUrl":null,"url":null,"abstract":"We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"22 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3436487","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

查看原文本刊更多论文

不确定文档生成器的恒定延迟枚举

我们考虑了被称为文档生成器的信息提取框架，并研究了从输入文档中高效计算提取结果的问题，其中提取任务被描述为顺序变量集自动机(VA)。我们在枚举算法的设置中提出了这个问题，在枚举算法中，我们可以首先运行预处理阶段，然后必须在任意两个连续结果之间以很小的延迟产生结果。我们的目标是拥有一种算法，它可以处理组合复杂性，即输入文档和VA的大小，同时确保输入文档大小的最佳数据复杂性界限，即文档大小的恒定延迟。最近在PODS’18上的几项工作提出了这样的算法，但在文档大小上存在线性延迟，或者在(通常不确定的)输入VA的大小上存在指数依赖关系。特别是，Florenzano等人认为，我们期望的运行时间保证不能满足一般顺序VA。我们反驳了这一点，并证明，给定一个不确定的顺序VA和一个输入文档，我们可以枚举VA在文档上的映射，其边界如下:预处理在文档大小和VA大小上是线性的，并且延迟与文档和VA大小的多项式无关。因此，所得算法在组合复杂度和最佳数据复杂度边界上实现了可跟踪性。此外，它很容易描述，特别是对于所谓的扩展VAs的限制情况。最后，我们使用原型实现对我们的算法进行了经验评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.