Code Generation in Serializers and Comparators of Apache Flink

G. Horváth, Norbert Pataki, Márton Balassi
{"title":"Code Generation in Serializers and Comparators of Apache Flink","authors":"G. Horváth, Norbert Pataki, Márton Balassi","doi":"10.1145/3098572.3098579","DOIUrl":null,"url":null,"abstract":"There is a shift in the Big Data world. Applications used to be I/O bound. InfiniBand, SSDs reduced the I/O overhead and more sophisticated algorithms were developed. CPU became a bottleneck for some applications. Using state of the art CPUs, reduced CPU usage can lead to reduced electricity costs even when an application is I/O bound. Apache Flink is an open source framework for processing streams of data and batch jobs. It is using serialization for wide variety of purposes. Not only for sending data over the network, saving it to the hard disk, or for fault tolerance, but also some of the operators can work on the serialized representation of the data instead of Java objects. This approach can improve the performance significantly. Flink has a custom serialization method that enables operators to work on the serialized formats. Currently, Apache Flink uses reflection to serialize Plain Old Java Objects (POJOs). Reflection in Java is notoriously slow. Moreover, the structure of the code is harder to optimize for the JIT compiler. As a Google Summer of Code project in 2016, we implemented code generation for serializers and comparators for POJOs to improve the performance of Apache Flink. Flink has a delicate type system which provides us with lots of information about the types that need to be serialized. Using this information it is possible to generate specialized code with great performance. We achieved more than 6X performance improvement in the serialization which was a 20% overall improvement.","PeriodicalId":368815,"journal":{"name":"Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3098572.3098579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

There is a shift in the Big Data world. Applications used to be I/O bound. InfiniBand, SSDs reduced the I/O overhead and more sophisticated algorithms were developed. CPU became a bottleneck for some applications. Using state of the art CPUs, reduced CPU usage can lead to reduced electricity costs even when an application is I/O bound. Apache Flink is an open source framework for processing streams of data and batch jobs. It is using serialization for wide variety of purposes. Not only for sending data over the network, saving it to the hard disk, or for fault tolerance, but also some of the operators can work on the serialized representation of the data instead of Java objects. This approach can improve the performance significantly. Flink has a custom serialization method that enables operators to work on the serialized formats. Currently, Apache Flink uses reflection to serialize Plain Old Java Objects (POJOs). Reflection in Java is notoriously slow. Moreover, the structure of the code is harder to optimize for the JIT compiler. As a Google Summer of Code project in 2016, we implemented code generation for serializers and comparators for POJOs to improve the performance of Apache Flink. Flink has a delicate type system which provides us with lots of information about the types that need to be serialized. Using this information it is possible to generate specialized code with great performance. We achieved more than 6X performance improvement in the serialization which was a 20% overall improvement.
Apache Flink序列化器和比较器中的代码生成
大数据世界正在发生变化。应用程序过去是受I/O限制的。InfiniBand、ssd减少了I/O开销,并开发了更复杂的算法。CPU成为某些应用程序的瓶颈。使用最先进的CPU,即使在应用程序受I/O限制的情况下,降低CPU使用率也可以降低电力成本。Apache Flink是一个用于处理数据流和批处理作业的开源框架。它将序列化用于各种各样的目的。不仅是为了通过网络发送数据,将其保存到硬盘,或者为了容错,而且一些操作符可以处理数据的序列化表示,而不是Java对象。这种方法可以显著提高性能。Flink有一个自定义序列化方法,使操作符能够处理序列化格式。目前,Apache Flink使用反射来序列化普通旧Java对象(pojo)。Java中的反射是出了名的慢。此外,代码的结构很难针对JIT编译器进行优化。作为2016年谷歌夏季代码项目,我们为pojo的序列化器和比较器实现了代码生成,以提高Apache Flink的性能。Flink有一个精致的类型系统,它为我们提供了大量关于需要序列化的类型的信息。使用这些信息,可以生成具有出色性能的专用代码。我们在序列化中实现了超过6倍的性能提升,总体提升了20%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信