通过内联反序列化加速Spark数据集

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI:10.1109/IPDPS.2017.111

Jan Wroblewski, K. Ishizaki, H. Inoue, Moriyoshi Ohara

{"title":"通过内联反序列化加速Spark数据集","authors":"Jan Wroblewski, K. Ishizaki, H. Inoue, Moriyoshi Ohara","doi":"10.1109/IPDPS.2017.111","DOIUrl":null,"url":null,"abstract":"Apache Spark is a framework for distributed computing that supports the map-reduce programming model. The SQL module of Spark contains Datasets, i.e., distributed collections of records stored in a serialized low-level format in a manually managed chunk of memory. However, the functions users provide to the map-reduce computations expect Java objects. Datasets perform an additional deserialization step beforehand to support the user-provided function, which increases the overhead. We tackled this problem by replacing map functions with their counterparts that accepted the serialized data. This allowed us to skip the unnecessary part of deserialization and achieve faster data processing speeds.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Accelerating Spark Datasets by Inlining Deserialization\",\"authors\":\"Jan Wroblewski, K. Ishizaki, H. Inoue, Moriyoshi Ohara\",\"doi\":\"10.1109/IPDPS.2017.111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Apache Spark is a framework for distributed computing that supports the map-reduce programming model. The SQL module of Spark contains Datasets, i.e., distributed collections of records stored in a serialized low-level format in a manually managed chunk of memory. However, the functions users provide to the map-reduce computations expect Java objects. Datasets perform an additional deserialization step beforehand to support the user-provided function, which increases the overhead. We tackled this problem by replacing map functions with their counterparts that accepted the serialized data. This allowed us to skip the unnecessary part of deserialization and achieve faster data processing speeds.\",\"PeriodicalId\":209524,\"journal\":{\"name\":\"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2017.111\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

Apache Spark是一个支持map-reduce编程模型的分布式计算框架。Spark的SQL模块包含数据集，即以序列化的低级格式存储在手动管理的内存块中的分布式记录集合。然而，用户提供给map-reduce计算的函数期望Java对象。数据集在支持用户提供的函数之前执行一个额外的反序列化步骤，这增加了开销。我们通过将map函数替换为接受序列化数据的对应函数来解决这个问题。这允许我们跳过不必要的反序列化部分，并实现更快的数据处理速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Spark Datasets by Inlining Deserialization

Apache Spark is a framework for distributed computing that supports the map-reduce programming model. The SQL module of Spark contains Datasets, i.e., distributed collections of records stored in a serialized low-level format in a manually managed chunk of memory. However, the functions users provide to the map-reduce computations expect Java objects. Datasets perform an additional deserialization step beforehand to support the user-provided function, which increases the overhead. We tackled this problem by replacing map functions with their counterparts that accepted the serialized data. This allowed us to skip the unnecessary part of deserialization and achieve faster data processing speeds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量