Beyond myopic inference in big data pipelines

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2013-08-11 DOI:10.1145/2487575.2487588

Karthik Raman, Adith Swaminathan, J. Gehrke, T. Joachims

引用次数: 12

Abstract

Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off end-to-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction.

查看原文本刊更多论文

超越大数据管道的短视推断

大数据管道将大型数据集的复杂分析分解为一系列更简单的任务，每个任务都有独立调优的组件。这种模块化设置允许跨多个不同管道重用组件。然而，独立调优的管道组件之间的交互会导致端到端性能差，因为一个组件引入的误差会级联整个管道，从而影响整体精度。我们提出了一种新颖的模型，用于以概率良好的方式跨大数据管道组件进行推理。我们的关键思想是将组件的交互视为对底层图形模型的依赖关系。该图形模型上的不同消息传递方案提供了不同的推理算法，以权衡端到端性能和计算成本。我们用一个高效的梁搜索算法实例化了我们的框架，并演示了它在两个大数据管道上的效率:解析和关系提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量