SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow

arXiv - CS - Databases Pub Date : 2024-09-03 DOI:arxiv-2409.01736

Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld

引用次数: 0

Abstract

Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying. This demonstration presents SpannerLib a library for embedding document spanners in Python code. SpannerLib facilitates the development of IE programs by providing an implementation of Spannerlog (Datalog-based documentspanners) that interacts with the Python code in two directions: rules can be embedded inside Python, and they can invoke custom Python code (e.g., calls to ML-based NLP models) via user-defined functions. The demonstration scenarios showcase IE programs, with increasing levels of complexity, within Jupyter Notebook.

查看原文本刊更多论文

SpannerLib：在命令式工作流中嵌入声明式信息提取

继工业界和学术界的信息提取产品之后，人们又提出了从文本中进行声明式信息提取（IE）的正式框架--文档生成器（Document Spanners）。在过去十年中，该框架在表达能力、复杂性以及将文本分析与关系查询自然结合的能力等方面都得到了深入研究。本演示介绍了 SpannerLiba 库，用于在 Python 代码中嵌入文档生成器。SpannerLib 通过提供一个与 Python 代码双向交互的 Spannerlog（基于 Datalog 的文档生成器）实现，促进了 IE 程序的开发：规则可以嵌入到 Python 中，并且可以通过用户自定义函数调用自定义 Python 代码（例如，调用基于 ML 的 NLP 模型）。演示场景展示了 Jupyter Notebook 中复杂程度不断提高的 IE 程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量