整合猪与竖琴，支持迭代应用与快速缓存和定制通信

2014 5th International Workshop on Data-Intensive Computing in the Clouds Pub Date : 2014-11-16 DOI:10.1109/DataCloud.2014.8

T. Wu, A. Koppula, J. Qiu

{"title":"整合猪与竖琴，支持迭代应用与快速缓存和定制通信","authors":"T. Wu, A. Koppula, J. Qiu","doi":"10.1109/DataCloud.2014.8","DOIUrl":null,"url":null,"abstract":"Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication\",\"authors\":\"T. Wu, A. Koppula, J. Qiu\",\"doi\":\"10.1109/DataCloud.2014.8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.\",\"PeriodicalId\":121831,\"journal\":{\"name\":\"2014 5th International Workshop on Data-Intensive Computing in the Clouds\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 5th International Workshop on Data-Intensive Computing in the Clouds\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DataCloud.2014.8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DataCloud.2014.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

使用高级脚本语言解决大数据问题已经成为复杂机器学习数据分析的主流方法。为了完成一个完整的任务，通常必须在计算的几个步骤中使用数据。使用标准的Hadoop MapReduce运行时组合默认数据转换操作符非常方便。然而，目前使用高级语言支持Hadoop MapReduce迭代应用程序的策略依赖于其他语言(如Python和Groovy)的外部包装器脚本，这会在作业之间重新启动映射器和reducer时导致显著的性能损失。在本文中，我们通过将Apache Pig与印第安纳大学开发的高性能Hadoop插件Harp集成来减少额外的作业启动开销。这为数据分析提供了快速的数据缓存和迭代之间的自定义通信模式。结果表明，从2到5个因素的性能提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication

Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 5th International Workshop on Data-Intensive Computing in the Clouds

自引率

0.00%

发文量