Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data Pub Date : 2009-06-29 DOI:10.1145/1559845.1559881

Fei Chen, Byron J. Gao, A. Doan, Jun Yang, R. Ramakrishnan

{"title":"Optimizing complex extraction programs over evolving text data","authors":"Fei Chen, Byron J. Gao, A. Doan, Jun Yang, R. Ramakrishnan","doi":"10.1145/1559845.1559881","DOIUrl":null,"url":null,"abstract":"Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE ``blackbox.'' In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional ``workflow.'' In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1559845.1559881","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE ``blackbox.'' In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional ``workflow.'' In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.

查看原文本刊更多论文

在不断发展的文本数据上优化复杂的提取程序

大多数信息提取(IE)方法只考虑静态文本语料库，我们只对其应用一次IE。然而，许多现实世界的文本语料库是动态的。它们随着时间的推移而演变，因此为了使提取的信息保持最新，我们经常必须对连续的语料库快照重复应用IE。从头开始将IE应用于每个快照可能会花费很多时间。为了避免这种情况，我们最近开发了Cyclex，这是一个循环以前的IE结果的系统，可以在随后的语料库快照中加速IE。Cyclex清楚地展示了回收理念的前景。然而，这项工作本身是有限的，因为它只考虑包含单个IE“黑盒”的IE程序。“实际上，许多IE程序要复杂得多，包含多个IE黑盒子，它们以一个组合的‘工作流’连接在一起。在本文中，我们提出了Delex，一个消除上述限制的系统。首先，我们确定了Delex提出的许多困难挑战，包括为回收目的建模复杂的IE程序，有效地实施回收过程，并在具有不同回收方案的巨大计划空间中寻找最佳执行计划。接下来，我们将介绍应对这些挑战的解决方案。最后，我们描述了基于规则和基于学习的IE程序在两个真实世界数据集上的广泛实验，这证明了我们方法的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

自引率

0.00%

发文量