INSTalytics

ACM Transactions on Storage (TOS) Pub Date : 2020-01-16 DOI:10.1145/3369738

Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia

{"title":"INSTalytics","authors":"Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia","doi":"10.1145/3369738","DOIUrl":null,"url":null,"abstract":"We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"INSTalytics\",\"authors\":\"Muthian Sivathanu, Midhul Vuppalapati, Bhargav S. Gulavani, K. Rajan, Jyoti Leeka, Jayashree Mohan, Piyus Kedia\",\"doi\":\"10.1145/3369738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.\",\"PeriodicalId\":273014,\"journal\":{\"name\":\"ACM Transactions on Storage (TOS)\",\"volume\":\"159 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Storage (TOS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3369738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage (TOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3369738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

我们介绍了INSTalytics的设计、实现和评估，INSTalytics是一个由集群文件系统和计算层共同设计的堆栈，用于大规模数据中心的高效大数据分析。INSTalytics放大了分析系统中众所周知的数据分区的好处;与传统的一维分区不同，INSTalytics支持以相同的存储成本在四个不同的维度上同时对数据进行分区，从而使更大比例的查询受益于分区过滤和连接，而无需网络洗换。为了实现这一点，INSTalytics使用计算感知来定制集群文件系统用于可用性的三向复制。新的异构复制布局使INSTalytics能够保持与传统复制相同的恢复成本和可用性。INSTalytics还使用计算感知来公开一个新的切片读取API，通过在存储节点上协调请求调度和选择性缓存，使多个计算节点能够有效地读取数据块的切片，从而提高连接的性能。我们已经在生产分析堆栈中构建了INSTalytics的原型实现，我们展示了恢复性能和可用性与物理复制相似，同时在查询性能方面提供了显着改进，提出了设计云规模大数据分析系统的新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

INSTalytics

We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Storage (TOS)

自引率

0.00%

发文量