Scalable lineage capture for debugging DISC analytics

Proceedings of the 4th annual Symposium on Cloud Computing Pub Date : 2013-10-01 DOI:10.1145/2523616.2523619

Dionysios Logothetis, Soumyarupa De, K. Yocum

{"title":"Scalable lineage capture for debugging DISC analytics","authors":"Dionysios Logothetis, Soumyarupa De, K. Yocum","doi":"10.1145/2523616.2523619","DOIUrl":null,"url":null,"abstract":"A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (<36%) and enables novel lineage-based debugging techniques. For instance, Newt can efficiently recreate errors (crashes or bad outputs) or remove input data from the dataflow to enable data cleaning strategies. Additionally, Newt's active lineage collection allows retro-spective analyses of a dataflow's behavior, such as identifying anomalous processing steps. As case studies, we instrument two DISC systems, Hadoop and Hyracks, with less than 105 lines of additional code for each. Finally, we use Newt to systematically clean input data to a Hadoop-based de novo genome assembler, improving the quality of the output assembly.","PeriodicalId":298547,"journal":{"name":"Proceedings of the 4th annual Symposium on Cloud Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"64","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th annual Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2523616.2523619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 64

Abstract

A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (<36%) and enables novel lineage-based debugging techniques. For instance, Newt can efficiently recreate errors (crashes or bad outputs) or remove input data from the dataflow to enable data cleaning strategies. Additionally, Newt's active lineage collection allows retro-spective analyses of a dataflow's behavior, such as identifying anomalous processing steps. As case studies, we instrument two DISC systems, Hadoop and Hyracks, with less than 105 lines of additional code for each. Finally, we use Newt to systematically clean input data to a Hadoop-based de novo genome assembler, improving the quality of the output assembly.

查看原文本刊更多论文

用于调试DISC分析的可扩展沿袭捕获

大数据分析的一个基本挑战是如何有效地调优和调试多步数据流。本文介绍了Newt，一个可扩展的架构，用于捕获和使用记录级数据沿袭来发现和解决分析中的错误。Newt灵活的仪器允许系统开发人员从一系列数据密集型可扩展计算(DISC)架构中收集这种细粒度血统，通过多步骤、用户定义的转换主动记录数据流。Newt将此API与横向扩展、容错的沿袭存储和查询引擎配对。我们发现，虽然活动收集可能很昂贵，但它为实际分析带来了适度的运行时开销(<36%)，并启用了新的基于继承的调试技术。例如，Newt可以有效地重新创建错误(崩溃或错误输出)或从数据流中删除输入数据以启用数据清理策略。此外，Newt的活动谱系收集允许对数据流的行为进行回顾性分析，例如识别异常处理步骤。作为案例研究，我们测试了两个DISC系统，Hadoop和hyrack，每个系统的附加代码都少于105行。最后，我们使用Newt系统地将输入数据清理到基于hadoop的de novo基因组组装器，从而提高输出组装的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 4th annual Symposium on Cloud Computing

自引率

0.00%

发文量