Tobias Drewes, J. Joseph, B. Gurumurthy, David Broneske, G. Saake, Thilo Pionteck
{"title":"Efficient Inter-Kernel Communication for OpenCL Database Operators on FPGAs","authors":"Tobias Drewes, J. Joseph, B. Gurumurthy, David Broneske, G. Saake, Thilo Pionteck","doi":"10.1109/FPT.2018.00050","DOIUrl":null,"url":null,"abstract":"Many modern database engines use OpenCL to target heterogeneous hardware. Queries are evaluated by execution of chains of low-level operators. The common paradigm for OpenCL workloads facilitates communication between kernels using buffers in off-chip memory. This poses a severe performance limitation due to weak memory systems of FPGAs in contrast to the memory hierarchy available in CPUs and GPUs. To overcome this bottleneck, we propose the use of structural optimizations of kernel code. On-chip pipelining and code fusion are analyzed as alternatives to buffer-based inter-kernel communication. We assess the impact on resource utilization and system throughput and thereby demonstrate that properly structured code achieves a speedup of more than 4x over the default paradigm. This shows that it is essential for chains of kernels to consider not only optimization techniques for individual kernels, but also optimization of inter-kernel communication.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Many modern database engines use OpenCL to target heterogeneous hardware. Queries are evaluated by execution of chains of low-level operators. The common paradigm for OpenCL workloads facilitates communication between kernels using buffers in off-chip memory. This poses a severe performance limitation due to weak memory systems of FPGAs in contrast to the memory hierarchy available in CPUs and GPUs. To overcome this bottleneck, we propose the use of structural optimizations of kernel code. On-chip pipelining and code fusion are analyzed as alternatives to buffer-based inter-kernel communication. We assess the impact on resource utilization and system throughput and thereby demonstrate that properly structured code achieves a speedup of more than 4x over the default paradigm. This shows that it is essential for chains of kernels to consider not only optimization techniques for individual kernels, but also optimization of inter-kernel communication.