G. Michas, Periklis Chrysogelos, Ioannis Mytilinis, A. Ailamaki
{"title":"Hardware-Conscious Sliding Window Aggregation on GPUs","authors":"G. Michas, Periklis Chrysogelos, Ioannis Mytilinis, A. Ailamaki","doi":"10.1145/3465998.3466014","DOIUrl":"https://doi.org/10.1145/3465998.3466014","url":null,"abstract":"Stream Processing Engines (SPEs) have recently begun utilizing heterogeneous coprocessors (e.g., GPUs) to meet the velocity requirements of modern real-time applications. The massive parallelism and high memory bandwidth of GPUs can significantly increase processing throughput in data-intensive streaming scenarios, such as windowed aggregations. However, previous research only focused on the overall architecture of hybrid CPU-GPU streaming systems and the need for efficient in-GPU window operators was overshadowed by the limited interconnect bandwidth. With aggregation taking up a significant portion of streaming workloads, in this work, we analyze and optimize the performance of sliding window aggregates over GPUs. Current implementations under-utilize the hardware, and for a range of query parameters they cannot even saturate the bandwidth of the interconnect. To optimize execution, we first evaluate the fundamental building blocks of streaming aggregation for GPUs and identify the performance bottlenecks. Then, we build Slider: an adaptive algorithm that selects the most appropriate primitives and kernel configurations based on the query parameters. Our evaluation shows that Slider outperforms previous approaches by 3×-1250×, and saturates both the interconnect and the memory bandwidth for a wide range of examined input workloads.","PeriodicalId":183683,"journal":{"name":"Proceedings of the 17th International Workshop on Data Management on New Hardware","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121507222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zichen Zhu, J. Mun, Aneesh Raman, Manos Athanassoulis
{"title":"Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices","authors":"Zichen Zhu, J. Mun, Aneesh Raman, Manos Athanassoulis","doi":"10.1145/3465998.3466002","DOIUrl":"https://doi.org/10.1145/3465998.3466002","url":null,"abstract":"Bloom filters (BFs) accelerate point lookups in Log-Structured Merge (LSM) trees by reducing unnecessary storage accesses to levels that do not contain the desired key. BFs are particularly beneficial when there is a significant performance difference between querying a BF (hashing and accessing memory) and accessing data (on secondary storage). This gap, however, is decreasing as modern storage devices (SSDs and NVMs) have increasingly lower latency, to the point that the cost of accessing data can be comparable to that of filter probing and hashing, especially for large key sizes that exhibit high hashing cost. In an LSM-tree, BFs are employed when querying each level of the tree, thus, exacerbating the CPU cost as the data size - and thus, the tree height - grows. To address the increasing CPU cost of BFs in LSM-trees, we propose to re-use hash calculations aggressively within and across BFs, as well as between different levels, and we show both analytically and experimentally that we can maintain a close-to-ideal false positive rate while significantly reducing the runtime. The reduced CPU cost for queries using the proposed hash sharing leads to 10% higher lookup performance in an LSM-tree with 22GB of data (5 levels) stored in a state-of-the-art PCIe SSD. The benefit further increases for faster underlying storage. Specifically, we show that for faster NVM devices, hash sharing leads to performance gains up to 40%.","PeriodicalId":183683,"journal":{"name":"Proceedings of the 17th International Workshop on Data Management on New Hardware","volume":"86 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126283531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johannes Fett, A. Ungethüm, Dirk Habich, Wolfgang Lehner
{"title":"The Case for SIMDified Analytical Query Processing on GPUs","authors":"Johannes Fett, A. Ungethüm, Dirk Habich, Wolfgang Lehner","doi":"10.1145/3465998.3466015","DOIUrl":"https://doi.org/10.1145/3465998.3466015","url":null,"abstract":"Data-level parallelism (DLP) is a heavily used hardware-driven parallelization technique to optimize the analytical query processing, especially in in-memory column stores. This kind of parallelism is characterized by executing essentially the same operation on different data elements simultaneously. Besides Single Instruction Multiple Data (SIMD) extensions on common x86-processors, GPUs also provide DLP but with a different execution model called Single Instruction Multiple Threads (SIMT), where multiple scalar threads are executed in a SIMD manner. Unfortunately, a complete GPU-specific implementation of all query operators has to be set up, since the state of the vectorized implementations cannot be ported from x86-processors to GPUs right now. To avoid this implementation effort, we present our vision to virtualize GPUs as virtual vector engines with software-defined SIMD instructions and to specialize hardware-oblivious vectorized operators to GPUs using our Template Vector Library (TVL) in this paper.","PeriodicalId":183683,"journal":{"name":"Proceedings of the 17th International Workshop on Data Management on New Hardware","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132721619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Knödler, Tobias Vinçon, Arthur Bernhardt, Ilia Petrov, Leonardo Solis-Vasquez, Lukas Weber, Andreas Koch
{"title":"A cost model for NDP-aware query optimization for KV-stores","authors":"Christian Knödler, Tobias Vinçon, Arthur Bernhardt, Ilia Petrov, Leonardo Solis-Vasquez, Lukas Weber, Andreas Koch","doi":"10.1145/3465998.3466013","DOIUrl":"https://doi.org/10.1145/3465998.3466013","url":null,"abstract":"Many modern DBMS architectures require transferring data from storage to process it afterwards. Given the continuously increasing amounts of data, data transfers quickly become a scalability limiting factor. Near-Data Processing and smart/computational storage emerge as promising trends allowing for decoupled in-situ operation execution, data transfer reduction and better bandwidth utilization. However, not every operation is suitable for an in-situ execution and a careful placement and optimization is needed. In this paper we present an NDP-aware cost model. It has been implemented in MySQL and evaluated with nKV. We make several observations underscoring the need for optimization.","PeriodicalId":183683,"journal":{"name":"Proceedings of the 17th International Workshop on Data Management on New Hardware","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123788099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}