{"title":"Analysis of TPC-DS: the first standard benchmark for SQL-based big data systems","authors":"Meikel Pöss, T. Rabl, H. Jacobsen","doi":"10.1145/3127479.3128603","DOIUrl":null,"url":null,"abstract":"The advent of Web 2.0 companies, such as Facebook, Google, and Amazon with their insatiable appetite for vast amounts of structured, semi-structured, and unstructured data, triggered the development of Hadoop and related tools, e.g., YARN, MapReduce, and Pig, as well as NoSQL databases. These tools form an open source software stack to support the processing of large and diverse data sets on clustered systems to perform decision support tasks. Recently, SQL is resurrecting in many of these solutions, e.g., Hive, Stinger, Impala, Shark, and Presto. At the same time, RDBMS vendors are adding Hadoop support into their SQL engines, e.g., IBM's Big SQL, Actian's Vortex, Oracle's Big Data SQL, and SAP's HANA. Because there was no industry standard benchmark that could measure the performance of SQL-based big data solutions, marketing claims were mostly based on \"cherry picked\" subsets of the TPC-DS benchmark to suit individual companies strengths, while blending out their weaknesses. In this paper, we present and analyze our work on modifying TPC-DS to fill the void for an industry standard benchmark that is able to measure the performance of SQL-based big data solutions. The new benchmark was ratified by the TPC in early 2016. To show the significance of the new benchmark, we analyze performance data obtained on four different systems running big data, traditional RDBMS, and columnar in-memory architectures.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127479.3128603","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36
Abstract
The advent of Web 2.0 companies, such as Facebook, Google, and Amazon with their insatiable appetite for vast amounts of structured, semi-structured, and unstructured data, triggered the development of Hadoop and related tools, e.g., YARN, MapReduce, and Pig, as well as NoSQL databases. These tools form an open source software stack to support the processing of large and diverse data sets on clustered systems to perform decision support tasks. Recently, SQL is resurrecting in many of these solutions, e.g., Hive, Stinger, Impala, Shark, and Presto. At the same time, RDBMS vendors are adding Hadoop support into their SQL engines, e.g., IBM's Big SQL, Actian's Vortex, Oracle's Big Data SQL, and SAP's HANA. Because there was no industry standard benchmark that could measure the performance of SQL-based big data solutions, marketing claims were mostly based on "cherry picked" subsets of the TPC-DS benchmark to suit individual companies strengths, while blending out their weaknesses. In this paper, we present and analyze our work on modifying TPC-DS to fill the void for an industry standard benchmark that is able to measure the performance of SQL-based big data solutions. The new benchmark was ratified by the TPC in early 2016. To show the significance of the new benchmark, we analyze performance data obtained on four different systems running big data, traditional RDBMS, and columnar in-memory architectures.
Web 2.0公司的出现,如Facebook、谷歌和Amazon,他们对大量结构化、半结构化和非结构化数据的贪欲无法满足,引发了Hadoop和相关工具的发展,如YARN、MapReduce和Pig,以及NoSQL数据库。这些工具形成了一个开源软件堆栈,以支持处理集群系统上的大型和不同的数据集,从而执行决策支持任务。最近,SQL在许多这些解决方案中复活,例如Hive、Stinger、Impala、Shark和Presto。与此同时,RDBMS供应商正在将Hadoop支持添加到他们的SQL引擎中,例如IBM的Big SQL、Actian的Vortex、Oracle的Big Data SQL和SAP的HANA。由于没有行业标准基准可以衡量基于sql的大数据解决方案的性能,营销主张主要是基于“精心挑选”的TPC-DS基准子集,以适应个别公司的优势,同时融合他们的弱点。在本文中,我们介绍并分析了我们在修改TPC-DS方面的工作,以填补能够衡量基于sql的大数据解决方案性能的行业标准基准的空白。2016年初,TPC批准了新的基准。为了展示新基准的重要性,我们分析了在运行大数据、传统RDBMS和列式内存架构的四个不同系统上获得的性能数据。