Processing Java UDFs in a C++ environment

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-09-24 DOI:10.1145/3127479.3132022

Viktor Rosenfeld, René Müller, Pınar Tözün, Fatma Özcan

{"title":"Processing Java UDFs in a C++ environment","authors":"Viktor Rosenfeld, René Müller, Pınar Tözün, Fatma Özcan","doi":"10.1145/3127479.3132022","DOIUrl":null,"url":null,"abstract":"Many popular big data analytics systems today make liberal use of user-defined functions (UDFs) in their programming interface and are written in languages based on the Java Virtual Machine (JVM). This combination creates a barrier when we want to integrate processing engines written in a language that compiles down to machine code with a JVM-based big data analytics ecosystem. In this paper, we investigate efficient ways of executing UDFs written in Java inside a data processing engine written in C++. While it is possible to call Java code from machine code via the Java Native Interface (JNI), a naive implementation that applies the UDF one row at a time incurs a significant overhead, up to an order of magnitude. Instead, we can significantly reduce the costs of JNI calls and data copies between Java and machine code, if we execute UDFs on batches of rows, and reuse input/output buffers when possible. Our evaluation of these techniques using different scalar UDFs, in a prototype system that combines Spark and a columnar data processing engine written in C++, shows that such a combination does not slow down the execution of SparkSQL queries containing such UDFs. In fact, we find that the execution of Java UDFs inside an embedded JVM in our C++ engine is 1.12X to 1.53X faster than executing in Spark alone. Our analysis also shows that compiling Java UDFs directly into machine code is not always beneficial over strided execution in the JVM.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127479.3132022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Many popular big data analytics systems today make liberal use of user-defined functions (UDFs) in their programming interface and are written in languages based on the Java Virtual Machine (JVM). This combination creates a barrier when we want to integrate processing engines written in a language that compiles down to machine code with a JVM-based big data analytics ecosystem. In this paper, we investigate efficient ways of executing UDFs written in Java inside a data processing engine written in C++. While it is possible to call Java code from machine code via the Java Native Interface (JNI), a naive implementation that applies the UDF one row at a time incurs a significant overhead, up to an order of magnitude. Instead, we can significantly reduce the costs of JNI calls and data copies between Java and machine code, if we execute UDFs on batches of rows, and reuse input/output buffers when possible. Our evaluation of these techniques using different scalar UDFs, in a prototype system that combines Spark and a columnar data processing engine written in C++, shows that such a combination does not slow down the execution of SparkSQL queries containing such UDFs. In fact, we find that the execution of Java UDFs inside an embedded JVM in our C++ engine is 1.12X to 1.53X faster than executing in Spark alone. Our analysis also shows that compiling Java UDFs directly into machine code is not always beneficial over strided execution in the JVM.

查看原文本刊更多论文

在c++环境中处理Java udf

如今，许多流行的大数据分析系统在其编程接口中自由使用用户定义函数(udf)，并使用基于Java虚拟机(JVM)的语言编写。当我们想要将用编译成机器代码的语言编写的处理引擎与基于jvm的大数据分析生态系统集成在一起时，这种组合创造了一个障碍。在本文中，我们研究了在用c++编写的数据处理引擎中执行用Java编写的udf的有效方法。虽然可以通过Java本机接口(Java Native Interface, JNI)从机器码调用Java代码，但是一次只应用一行UDF的幼稚实现会导致巨大的开销，最高可达一个数量级。相反，如果我们对成批的行执行udf，并在可能的情况下重用输入/输出缓冲区，我们可以显著降低JNI调用和Java与机器码之间数据复制的成本。我们在一个结合了Spark和用c++编写的列数据处理引擎的原型系统中，使用不同的标量udf对这些技术进行了评估，结果表明，这种组合不会减慢包含此类udf的SparkSQL查询的执行速度。事实上，我们发现在c++引擎的嵌入式JVM中执行Java udf比单独在Spark中执行快1.12到1.53倍。我们的分析还表明，将Java udf直接编译为机器码并不总是比在JVM中跨行执行更有利。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量