BigBench:迈向大数据分析的行业标准基准

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2013-06-22 DOI:10.1145/2463676.2463712

A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen

{"title":"BigBench:迈向大数据分析的行业标准基准","authors":"A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen","doi":"10.1145/2463676.2463712","DOIUrl":null,"url":null,"abstract":"There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.\n In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.\n We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"44 1","pages":"1197-1208"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"373","resultStr":"{\"title\":\"BigBench: towards an industry standard benchmark for big data analytics\",\"authors\":\"A. Ghazal, T. Rabl, Minqing Hu, Francois Raab, Meikel Poess, A. Crolotte, H. Jacobsen\",\"doi\":\"10.1145/2463676.2463712\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.\\n In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.\\n We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.\",\"PeriodicalId\":87344,\"journal\":{\"name\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"volume\":\"44 1\",\"pages\":\"1197-1208\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-06-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"373\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2463676.2463712\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2463676.2463712","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 373

摘要

学术界、工业界和大量用户对大数据有着极大的兴趣。一些商业和开源提供商发布了各种产品来支持大数据存储和处理。随着这些产品的成熟，有必要评估和比较这些系统的性能。在本文中，我们提出了BigBench，一个端到端的大数据基准提案。BigBench的基本商业模式是产品零售商。该提案涵盖了一个数据模型和合成数据生成器，解决了包含结构化、半结构化和非结构化数据的大数据系统的多样性、速度和数量方面的问题。BigBench数据模型的结构化部分采用了TPC-DS基准，其中丰富了半结构化和非结构化数据组件。半结构化部分捕获零售商网站上的注册用户和访客用户点击。非结构化数据捕获在线提交的产品评论。为BigBench设计的数据生成器基于比例因子提供可扩展的原始数据量。BigBench工作负载是围绕一组针对数据模型的查询设计的。从商业角度来看，这些查询涵盖了麦肯锡提出的不同类别的大数据分析。从技术角度来看，基于数据源、查询处理类型和分析技术，查询被设计为跨越三个不同的维度。我们通过在Teradata Aster数据库上实现BigBench来说明其可行性。该测试包括生成和加载一个200gb的BigBench数据集，并通过执行BigBench查询(使用Teradata Aster SQL-MR编写)和报告其响应时间来测试工作负载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BigBench: towards an industry standard benchmark for big data analytics

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量