Workload characterization and optimization of TPC-H queries on Apache Spark

Tatsuhiro Chiba, Tamiya Onodera
{"title":"Workload characterization and optimization of TPC-H queries on Apache Spark","authors":"Tatsuhiro Chiba, Tamiya Onodera","doi":"10.1109/ISPASS.2016.7482079","DOIUrl":null,"url":null,"abstract":"Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10-20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30-40% increase in speed on average and is up to 5x faster than the naive configuration.","PeriodicalId":416765,"journal":{"name":"2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2016.7482079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 50

Abstract

Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10-20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30-40% increase in speed on average and is up to 5x faster than the naive configuration.
Apache Spark上TPC-H查询的工作负载表征和优化
Spark除了是一个面向内存的计算框架外,还运行在Java虚拟机(JVM)之上,因此必须调整JVM参数以提高Spark应用程序的性能。配置错误的参数和设置会降低性能。例如,使用太大的Java堆通常会导致很长的垃圾收集暂停时间,这占应用程序执行时间的10-20%以上。此外,当前的计算节点有许多采用同步多线程技术的核心,节点上的处理器通过NUMA连接,因此如果不考虑这些硬件特性,很难获得最佳性能。因此,全栈中的优化也很重要。为了充分利用底层计算资源,不仅需要优化JVM参数,还需要优化操作系统参数、Spark配置和基于CPU特征的应用程序代码。在本文中,我们使用TPC-H基准测试作为我们的优化案例研究,并从性能监控单元收集了许多透视图日志,例如应用程序、JVM(例如GC和JIT)、系统利用率和硬件事件。我们讨论了当前的问题,并介绍了几种加速Spark性能的JVM和OS参数优化方法。结果,我们的优化显示速度平均提高了30-40%,比原始配置快了5倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信