整合猪与竖琴,支持迭代应用与快速缓存和定制通信

T. Wu, A. Koppula, J. Qiu
{"title":"整合猪与竖琴,支持迭代应用与快速缓存和定制通信","authors":"T. Wu, A. Koppula, J. Qiu","doi":"10.1109/DataCloud.2014.8","DOIUrl":null,"url":null,"abstract":"Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.","PeriodicalId":121831,"journal":{"name":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication\",\"authors\":\"T. Wu, A. Koppula, J. Qiu\",\"doi\":\"10.1109/DataCloud.2014.8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.\",\"PeriodicalId\":121831,\"journal\":{\"name\":\"2014 5th International Workshop on Data-Intensive Computing in the Clouds\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 5th International Workshop on Data-Intensive Computing in the Clouds\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DataCloud.2014.8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 5th International Workshop on Data-Intensive Computing in the Clouds","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DataCloud.2014.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

使用高级脚本语言解决大数据问题已经成为复杂机器学习数据分析的主流方法。为了完成一个完整的任务,通常必须在计算的几个步骤中使用数据。使用标准的Hadoop MapReduce运行时组合默认数据转换操作符非常方便。然而,目前使用高级语言支持Hadoop MapReduce迭代应用程序的策略依赖于其他语言(如Python和Groovy)的外部包装器脚本,这会在作业之间重新启动映射器和reducer时导致显著的性能损失。在本文中,我们通过将Apache Pig与印第安纳大学开发的高性能Hadoop插件Harp集成来减少额外的作业启动开销。这为数据分析提供了快速的数据缓存和迭代之间的自定义通信模式。结果表明,从2到5个因素的性能提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Integrating Pig with Harp to Support Iterative Applications with Fast Cache and Customized Communication
Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data transformation operators with the standard Hadoop MapReduce runtime is very convenient. However, the current strategy of using high-level languages to support iterative applications with Hadoop MapReduce relies on an external wrapper script in other languages such as Python and Groovy, which causes significant performance loss when restarting mappers and reducers between jobs. In this paper, we reduce the extra job startup overheads by integrating Apache Pig with the high-performance Hadoop plug-in Harp developed at Indiana University. This provides fast data caching and customized communication patterns among iterations for data analysis. The results show performance improvements of factors from 2 to 5.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信