Workload characterization and optimization of TPC-H queries on Apache Spark

2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2016-04-17 DOI:10.1109/ISPASS.2016.7482079

Tatsuhiro Chiba, Tamiya Onodera

{"title":"Workload characterization and optimization of TPC-H queries on Apache Spark","authors":"Tatsuhiro Chiba, Tamiya Onodera","doi":"10.1109/ISPASS.2016.7482079","DOIUrl":null,"url":null,"abstract":"Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10-20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30-40% increase in speed on average and is up to 5x faster than the naive configuration.","PeriodicalId":416765,"journal":{"name":"2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2016.7482079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10-20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30-40% increase in speed on average and is up to 5x faster than the naive configuration.

查看原文本刊更多论文

Apache Spark上TPC-H查询的工作负载表征和优化

Spark除了是一个面向内存的计算框架外，还运行在Java虚拟机(JVM)之上，因此必须调整JVM参数以提高Spark应用程序的性能。配置错误的参数和设置会降低性能。例如，使用太大的Java堆通常会导致很长的垃圾收集暂停时间，这占应用程序执行时间的10-20%以上。此外，当前的计算节点有许多采用同步多线程技术的核心，节点上的处理器通过NUMA连接，因此如果不考虑这些硬件特性，很难获得最佳性能。因此，全栈中的优化也很重要。为了充分利用底层计算资源，不仅需要优化JVM参数，还需要优化操作系统参数、Spark配置和基于CPU特征的应用程序代码。在本文中，我们使用TPC-H基准测试作为我们的优化案例研究，并从性能监控单元收集了许多透视图日志，例如应用程序、JVM(例如GC和JIT)、系统利用率和硬件事件。我们讨论了当前的问题，并介绍了几种加速Spark性能的JVM和OS参数优化方法。结果，我们的优化显示速度平均提高了30-40%，比原始配置快了5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量