Stocator: an object store aware connector for apache spark

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-09-24 DOI:10.1145/3127479.3134761

G. Vernik, M. Factor, E. K. Kolodner, Effi Ofer, P. Michiardi, Francesco Pace

{"title":"Stocator: an object store aware connector for apache spark","authors":"G. Vernik, M. Factor, E. K. Kolodner, Effi Ofer, P. Michiardi, Francesco Pace","doi":"10.1145/3127479.3134761","DOIUrl":null,"url":null,"abstract":"Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., for genomics, for media and entertainment, and for Internet of Things. Object storage systems such as Amazon S3, Azure Blob storage, and IBM Cloud Object Storage, are highly scalable distributed storage systems that offer high capacity, cost effective storage. But it is not enough just to store data; we also need to derive value from it. Apache Spark is the leading big data analytics processing engine combining MapReduce, SQL, streaming, and complex analytics. We present Stocator, a high performance storage connector, enabling Spark to work directly on data stored in object storage systems, while providing the same correctness guarantees as Hadoop's original storage system, HDFS. Current object storage connectors from the Hadoop community, e.g., for the S3 and Swift APIs, do not deal well with eventual consistency, which can lead to failure. These connectors assume file system semantics, which is natural given that their model of operation is based on interaction with HDFS. In particular, Spark and Hadoop achieve fault tolerance and enable speculative execution by creating temporary files, listing directories to identify these files, and then renaming them. This paradigm avoids interference between tasks doing the same work and thus writing output with the same name. However, with eventually consistent object storage, a container listing may not yet include a recently created object, and thus an object may not be renamed, leading to incomplete or incorrect results. Solutions such as EMRFS [1] from Amazon, S3mper [4] from Netflix, and S3Guard [2], attempt to overcome eventual consistency by requiring additional strongly consistent data storage. These solutions require multiple storage systems, are costly, and can introduce issues of consistency between the stores. Current object storage connectors from the Hadoop community are also notorious for their poor performance for write workloads. This, too, stems from their use of the rename operation, which is not a native object storage operation; not only is it not atomic, but it must be implemented using a costly copy operation, followed by delete. Others have tried to improve the performance of object storage connectors by eliminating rename, e.g., the Direct-ParquetOutputCommitter [5] for S3a introduced by Databricks, but have failed to preserve fault tolerance and speculation. Stocator takes advantage of object storage semantics to achieve both high performance and fault tolerance. It eliminates the rename paradigm by writing each output object to its final name. The name includes both the part number and the attempt number, so that multiple attempts to write the same part use different objects. Stocator proposes to extend an already existing success indicator object written at the end of a Spark job, to include a manifest with the names of all the objects that compose the final output; this ensures that a subsequent job will correctly read the output, without resorting to a list operation whose results may not be consistent. By leveraging the inherent atomicity of object creation and using a manifest we obtain fault tolerance and enable speculative execution; by avoiding the rename paradigm we greatly decrease the complexity of the connector and the number of operations on the object storage. We have implemented our connector and shared it in open source [3]. We have compared its performance with the S3a and Hadoop Swift connectors over a range of workloads and found that it executes many fewer operations on the object storage, in some cases as few as one thirtieth. Since the price for an object storage service typically includes charges based on the number of operations executed, this reduction in operations lowers the costs for clients in addition to reducing the load on client software. It also reduces costs and load for the object storage provider since it can serve more clients with the same amount of processing power. Stocator also substantially increases performance for Spark workloads running over object storage, especially for write intensive workloads, where it is as much as 18 times faster.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127479.3134761","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., for genomics, for media and entertainment, and for Internet of Things. Object storage systems such as Amazon S3, Azure Blob storage, and IBM Cloud Object Storage, are highly scalable distributed storage systems that offer high capacity, cost effective storage. But it is not enough just to store data; we also need to derive value from it. Apache Spark is the leading big data analytics processing engine combining MapReduce, SQL, streaming, and complex analytics. We present Stocator, a high performance storage connector, enabling Spark to work directly on data stored in object storage systems, while providing the same correctness guarantees as Hadoop's original storage system, HDFS. Current object storage connectors from the Hadoop community, e.g., for the S3 and Swift APIs, do not deal well with eventual consistency, which can lead to failure. These connectors assume file system semantics, which is natural given that their model of operation is based on interaction with HDFS. In particular, Spark and Hadoop achieve fault tolerance and enable speculative execution by creating temporary files, listing directories to identify these files, and then renaming them. This paradigm avoids interference between tasks doing the same work and thus writing output with the same name. However, with eventually consistent object storage, a container listing may not yet include a recently created object, and thus an object may not be renamed, leading to incomplete or incorrect results. Solutions such as EMRFS [1] from Amazon, S3mper [4] from Netflix, and S3Guard [2], attempt to overcome eventual consistency by requiring additional strongly consistent data storage. These solutions require multiple storage systems, are costly, and can introduce issues of consistency between the stores. Current object storage connectors from the Hadoop community are also notorious for their poor performance for write workloads. This, too, stems from their use of the rename operation, which is not a native object storage operation; not only is it not atomic, but it must be implemented using a costly copy operation, followed by delete. Others have tried to improve the performance of object storage connectors by eliminating rename, e.g., the Direct-ParquetOutputCommitter [5] for S3a introduced by Databricks, but have failed to preserve fault tolerance and speculation. Stocator takes advantage of object storage semantics to achieve both high performance and fault tolerance. It eliminates the rename paradigm by writing each output object to its final name. The name includes both the part number and the attempt number, so that multiple attempts to write the same part use different objects. Stocator proposes to extend an already existing success indicator object written at the end of a Spark job, to include a manifest with the names of all the objects that compose the final output; this ensures that a subsequent job will correctly read the output, without resorting to a list operation whose results may not be consistent. By leveraging the inherent atomicity of object creation and using a manifest we obtain fault tolerance and enable speculative execution; by avoiding the rename paradigm we greatly decrease the complexity of the connector and the number of operations on the object storage. We have implemented our connector and shared it in open source [3]. We have compared its performance with the S3a and Hadoop Swift connectors over a range of workloads and found that it executes many fewer operations on the object storage, in some cases as few as one thirtieth. Since the price for an object storage service typically includes charges based on the number of operations executed, this reduction in operations lowers the costs for clients in addition to reducing the load on client software. It also reduces costs and load for the object storage provider since it can serve more clients with the same amount of processing power. Stocator also substantially increases performance for Spark workloads running over object storage, especially for write intensive workloads, where it is as much as 18 times faster.

查看原文本刊更多论文

Stocator:一个用于apache spark的对象存储感知连接器

数据是21世纪的自然资源。它正在以令人眼花缭乱的速度生产，例如基因组学，媒体和娱乐以及物联网。对象存储系统(如Amazon S3、Azure Blob存储和IBM Cloud Object storage)是高度可扩展的分布式存储系统，提供高容量、高成本效益的存储。但仅仅存储数据是不够的;我们还需要从中获得价值。Apache Spark是领先的大数据分析处理引擎，结合了MapReduce、SQL、流和复杂分析。我们介绍了Stocator，一个高性能的存储连接器，使Spark能够直接处理存储在对象存储系统中的数据，同时提供与Hadoop原始存储系统HDFS相同的正确性保证。目前来自Hadoop社区的对象存储连接器，例如S3和Swift api，不能很好地处理最终的一致性，这可能导致失败。这些连接器假定文件系统语义，这是很自然的，因为它们的操作模型是基于与HDFS的交互。特别是，Spark和Hadoop通过创建临时文件，列出目录来识别这些文件，然后重命名它们来实现容错和推测执行。此范例避免了执行相同工作的任务之间的干扰，从而避免了使用相同名称编写输出。然而，对于最终一致的对象存储，容器清单可能还没有包含最近创建的对象，因此对象可能没有被重命名，从而导致不完整或不正确的结果。Amazon的EMRFS[1]、Netflix的S3mper[4]和S3Guard[2]等解决方案试图通过需要额外的强一致性数据存储来克服最终的一致性。这些解决方案需要多个存储系统，成本很高，并且可能导致存储之间的一致性问题。目前来自Hadoop社区的对象存储连接器也因其糟糕的写工作负载性能而臭名昭著。这也源于它们使用的重命名操作，该操作不是本机对象存储操作;它不仅不是原子的，而且必须使用代价高昂的复制操作来实现，然后再执行删除操作。其他人试图通过消除重命名来提高对象存储连接器的性能，例如，Databricks为S3a引入的Direct-ParquetOutputCommitter[5]，但未能保持容错和推测性。Stocator利用对象存储语义来实现高性能和容错性。它通过将每个输出对象写入其最终名称来消除重命名范例。名称包括部件号和尝试号，以便多次尝试写入相同的部件时使用不同的对象。Stocator建议扩展已经存在的在Spark作业结束时编写的成功指示器对象，以包含包含组成最终输出的所有对象名称的清单;这确保后续作业将正确读取输出，而无需诉诸结果可能不一致的列表操作。通过利用对象创建的固有原子性和使用清单，我们获得容错性并启用推测执行;通过避免重命名范例，我们大大降低了连接器的复杂性和对象存储上的操作数量。我们已经实现了我们的连接器，并在开源中共享了它[3]。我们将其性能与S3a和Hadoop Swift连接器在一系列工作负载下的性能进行了比较，发现它在对象存储上执行的操作要少得多，在某些情况下只有三十分之一。由于对象存储服务的价格通常包括基于执行的操作数量的收费，因此这种操作的减少除了减少客户端软件的负载外，还降低了客户端的成本。它还降低了对象存储提供商的成本和负载，因为它可以用相同的处理能力为更多的客户端提供服务。Stocator还大大提高了运行在对象存储上的Spark工作负载的性能，特别是对于写密集型工作负载，它的速度可以提高18倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量