Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld
{"title":"SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow","authors":"Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld","doi":"arxiv-2409.01736","DOIUrl":null,"url":null,"abstract":"Document spanners have been proposed as a formal framework for declarative\nInformation Extraction (IE) from text, following IE products from the industry\nand academia. Over the past decade, the framework has been studied thoroughly\nin terms of expressive power, complexity, and the ability to naturally combine\ntext analysis with relational querying. This demonstration presents SpannerLib\na library for embedding document spanners in Python code. SpannerLib\nfacilitates the development of IE programs by providing an implementation of\nSpannerlog (Datalog-based documentspanners) that interacts with the Python code\nin two directions: rules can be embedded inside Python, and they can invoke\ncustom Python code (e.g., calls to ML-based NLP models) via user-defined\nfunctions. The demonstration scenarios showcase IE programs, with increasing\nlevels of complexity, within Jupyter Notebook.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Document spanners have been proposed as a formal framework for declarative
Information Extraction (IE) from text, following IE products from the industry
and academia. Over the past decade, the framework has been studied thoroughly
in terms of expressive power, complexity, and the ability to naturally combine
text analysis with relational querying. This demonstration presents SpannerLib
a library for embedding document spanners in Python code. SpannerLib
facilitates the development of IE programs by providing an implementation of
Spannerlog (Datalog-based documentspanners) that interacts with the Python code
in two directions: rules can be embedded inside Python, and they can invoke
custom Python code (e.g., calls to ML-based NLP models) via user-defined
functions. The demonstration scenarios showcase IE programs, with increasing
levels of complexity, within Jupyter Notebook.