Code Generation in Serializers and Comparators of Apache Flink

Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems Pub Date : 2017-06-19 DOI:10.1145/3098572.3098579

G. Horváth, Norbert Pataki, Márton Balassi

{"title":"Code Generation in Serializers and Comparators of Apache Flink","authors":"G. Horváth, Norbert Pataki, Márton Balassi","doi":"10.1145/3098572.3098579","DOIUrl":null,"url":null,"abstract":"There is a shift in the Big Data world. Applications used to be I/O bound. InfiniBand, SSDs reduced the I/O overhead and more sophisticated algorithms were developed. CPU became a bottleneck for some applications. Using state of the art CPUs, reduced CPU usage can lead to reduced electricity costs even when an application is I/O bound. Apache Flink is an open source framework for processing streams of data and batch jobs. It is using serialization for wide variety of purposes. Not only for sending data over the network, saving it to the hard disk, or for fault tolerance, but also some of the operators can work on the serialized representation of the data instead of Java objects. This approach can improve the performance significantly. Flink has a custom serialization method that enables operators to work on the serialized formats. Currently, Apache Flink uses reflection to serialize Plain Old Java Objects (POJOs). Reflection in Java is notoriously slow. Moreover, the structure of the code is harder to optimize for the JIT compiler. As a Google Summer of Code project in 2016, we implemented code generation for serializers and comparators for POJOs to improve the performance of Apache Flink. Flink has a delicate type system which provides us with lots of information about the types that need to be serialized. Using this information it is possible to generate specialized code with great performance. We achieved more than 6X performance improvement in the serialization which was a 20% overall improvement.","PeriodicalId":368815,"journal":{"name":"Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3098572.3098579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

There is a shift in the Big Data world. Applications used to be I/O bound. InfiniBand, SSDs reduced the I/O overhead and more sophisticated algorithms were developed. CPU became a bottleneck for some applications. Using state of the art CPUs, reduced CPU usage can lead to reduced electricity costs even when an application is I/O bound. Apache Flink is an open source framework for processing streams of data and batch jobs. It is using serialization for wide variety of purposes. Not only for sending data over the network, saving it to the hard disk, or for fault tolerance, but also some of the operators can work on the serialized representation of the data instead of Java objects. This approach can improve the performance significantly. Flink has a custom serialization method that enables operators to work on the serialized formats. Currently, Apache Flink uses reflection to serialize Plain Old Java Objects (POJOs). Reflection in Java is notoriously slow. Moreover, the structure of the code is harder to optimize for the JIT compiler. As a Google Summer of Code project in 2016, we implemented code generation for serializers and comparators for POJOs to improve the performance of Apache Flink. Flink has a delicate type system which provides us with lots of information about the types that need to be serialized. Using this information it is possible to generate specialized code with great performance. We achieved more than 6X performance improvement in the serialization which was a 20% overall improvement.

查看原文本刊更多论文

Apache Flink序列化器和比较器中的代码生成

大数据世界正在发生变化。应用程序过去是受I/O限制的。InfiniBand、ssd减少了I/O开销，并开发了更复杂的算法。CPU成为某些应用程序的瓶颈。使用最先进的CPU，即使在应用程序受I/O限制的情况下，降低CPU使用率也可以降低电力成本。Apache Flink是一个用于处理数据流和批处理作业的开源框架。它将序列化用于各种各样的目的。不仅是为了通过网络发送数据，将其保存到硬盘，或者为了容错，而且一些操作符可以处理数据的序列化表示，而不是Java对象。这种方法可以显著提高性能。Flink有一个自定义序列化方法，使操作符能够处理序列化格式。目前，Apache Flink使用反射来序列化普通旧Java对象(pojo)。Java中的反射是出了名的慢。此外，代码的结构很难针对JIT编译器进行优化。作为2016年谷歌夏季代码项目，我们为pojo的序列化器和比较器实现了代码生成，以提高Apache Flink的性能。Flink有一个精致的类型系统，它为我们提供了大量关于需要序列化的类型的信息。使用这些信息，可以生成具有出色性能的专用代码。我们在序列化中实现了超过6倍的性能提升，总体提升了20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems

自引率

0.00%

发文量