大规模数据分析方法的比较

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data Pub Date : 2009-06-29 DOI:10.1145/1559845.1559865

Andrew Pavlo, Erik Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker

{"title":"大规模数据分析方法的比较","authors":"Andrew Pavlo, Erik Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker","doi":"10.1145/1559845.1559865","DOIUrl":null,"url":null,"abstract":"There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1244","resultStr":"{\"title\":\"A comparison of approaches to large-scale data analysis\",\"authors\":\"Andrew Pavlo, Erik Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker\",\"doi\":\"10.1145/1559845.1559865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.\",\"PeriodicalId\":344093,\"journal\":{\"name\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1244\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1559845.1559865\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1559845.1559865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1244

摘要

目前，人们对大规模数据分析的MapReduce (MR)范式有着相当大的热情[17]。尽管该框架的基本控制流在并行SQL数据库管理系统(DBMS)中已经存在了20多年，但一些人称MR为一种全新的计算模型[8,17]。在本文中，我们描述并比较了这两种范式。此外，我们根据性能和开发复杂性来评估这两种系统。为此，我们定义了一个由一组任务组成的基准，我们在开放源码版本的MR和两个并行dbms上运行了这些任务。对于每个任务，我们在100个节点的集群上测量每个系统的不同并行度的性能。我们的结果揭示了一些有趣的权衡。尽管将数据加载到并行dbms并调优执行的过程比MR系统花费的时间要长得多，但观察到这些dbms的性能明显更好。我们推测了巨大的性能差异的原因，并考虑了未来系统应该从这两种体系结构中汲取的实现概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comparison of approaches to large-scale data analysis

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

自引率

0.00%

发文量