{"title":"The Difficult Balance Between Modern Hardware and Conventional CPUs","authors":"Fabio Maschi, G. Alonso","doi":"10.1145/3592980.3595314","DOIUrl":"https://doi.org/10.1145/3592980.3595314","url":null,"abstract":"Research has demonstrated the potential of accelerators in a wide range of use cases. However, there is a growing imbalance between modern hardware and the CPUs that submit the workload. Recent studies of GPUs on real systems have shown that many servers are often needed per accelerator to generate a high enough load so the computing power is leveraged. This fact is often ignored in research, although it often determines the actual feasibility and overall efficiency of a deployment. In this paper, we conduct a detailed study of the possible configurations and overall cost efficiency of deploying an FPGA-based accelerator on a commercial search engine. First, we show that there are many possible configurations balancing the upstream system and the way the accelerator is configured. Of these configurations, not all of them are suitable in practice, even if they provide some of the highest throughput. Second, we analyse the cost of a deployment capable of sustaining the required workload of the commercial search engine. We examine deployments both on-premises and in the cloud with and without FPGAs and with different board models. The results show that, while FPGAs have the potential to significantly improve overall performance, the performance imbalance between their host CPUs and the FPGAs can make the deployments economically unattractive. These findings are intended to inform the development and deployment of accelerators by showing what is needed on the CPU side to make them effective and also to provide important insights into their end-to-end integration within existing systems.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microarchitectural Analysis of Graph BI Queries on RDBMS","authors":"Rathijit Sen, Yuanyuan Tian","doi":"10.1145/3592980.3595321","DOIUrl":"https://doi.org/10.1145/3592980.3595321","url":null,"abstract":"We present results of microarchitectural analysis for LDBC SNB BI queries on a relational database engine. We find underutilization of multicore CPUs, inefficient instruction execution, data access overheads at the on-chip cache hierarchy, data TLB overheads, and overall low (but short-term high) memory bandwidth utilization. Using huge pages increased query performance by up to 65% and workload performance by 23%.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126404773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehdi Moghaddamfar, Christian Färber, Wolfgang Lehner, Akash Kumar
{"title":"KeRRaS: Sort-Based Database Query Processing on Wide Tables Using FPGAs","authors":"Mehdi Moghaddamfar, Christian Färber, Wolfgang Lehner, Akash Kumar","doi":"10.1145/3592980.3595300","DOIUrl":"https://doi.org/10.1145/3592980.3595300","url":null,"abstract":"Sorting is an important operation in database query processing. Complex pipeline-breaking operators (e.g., aggregation and equi-join) become single-pass algorithms on sorted tables. Therefore, sort-based query processing is a popular method for FPGA-based database system acceleration. However, most accelerators have a limit on the table width or the number of columns they can sort. This limit is often set by the width of the data path or the amount of BRAM present on the FPGA. In this paper we propose KeRRaS, an abstract sorting algorithm that enables existing sort-based query processors to support arbitrarily wide tables while offering scalability, preserving modularity, and having low resource overhead. Moreover, we present an implementation of KeRRaS based on morphing sort-merge, a resource-efficient FPGA-based query accelerator. The implementation behaves similarly to morphing sort-merge on narrow tables, and scales well as the number of key columns increases.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117278666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processing-in-Memory for Databases: Query Processing and Data Transfer","authors":"Alexander Baumstark, M. Jibril, K. Sattler","doi":"10.1145/3592980.3595323","DOIUrl":"https://doi.org/10.1145/3592980.3595323","url":null,"abstract":"The Processing-in-Memory (PIM) paradigm promises to accelerate data processing by pushing down computation to memory, reducing the amount of data transfer between memory and CPU, and – in this way – relieving the CPU from processing. Particularly, in in-memory databases memory access becomes a performance bottleneck. Thus, PIM seems to offer an interesting solution for database processing. In this work, we investigate how commercially available PIM technology can be leveraged to accelerate query processing by offloading (parts of) query operators to memory. Furthermore, we show how to address the problem of limited PIM storage capacity by interleaving transfer and computation and present a cost model for the data placement problem.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130993983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donghun Lee, Thomas Willhalm, Minseon Ahn, Suprasad Mutalik Desai, Daniel Booss, Navneet Singh, Daniel Ritter, Jungmin Kim, Oliver Rebholz
{"title":"Elastic Use of Far Memory for In-Memory Database Management Systems","authors":"Donghun Lee, Thomas Willhalm, Minseon Ahn, Suprasad Mutalik Desai, Daniel Booss, Navneet Singh, Daniel Ritter, Jungmin Kim, Oliver Rebholz","doi":"10.1145/3592980.3595311","DOIUrl":"https://doi.org/10.1145/3592980.3595311","url":null,"abstract":"The separation and independent scalability of compute and memory is one of the crucial aspects for modern in-memory database systems (IMDBMSs) in the cloud. The new, cache-coherent memory interconnect Compute Express Link (CXL) promises elastic memory capacity through memory pooling. In this work, we adapt the well-known IMDBMS, SAP HANA, for memory pools by features of table data placement and operational heap memory allocation on far memory, and study the impact of the limited bandwidth and higher latency of CXL. Our results show negligible performance degradation for TPC-C. For the analytical workloads of TPC-H, a notable impact on query processing is observed due to the limited bandwidth and long latency of our early CXL implementation. However, our emulation shows it would be acceptably smaller with the improved CXL memory devices.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"487 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133272141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niclas Hedam, Morten Tychsen Clausen, Philippe Bonnet, Sangjin Lee, Ken Friis Larsen
{"title":"Delilah: eBPF-offload on Computational Storage","authors":"Niclas Hedam, Morten Tychsen Clausen, Philippe Bonnet, Sangjin Lee, Ken Friis Larsen","doi":"10.1145/3592980.3595319","DOIUrl":"https://doi.org/10.1145/3592980.3595319","url":null,"abstract":"The idea of pushing computation to storage devices has been explored for decades, without widespread adoption so far. The definition of Computational Programs namespaces in NVMe (TP 4091) might be a breakthrough. The proposal defines device-specific programs, that are installed statically, and downloadable programs, offloaded from a host at run-time using eBPF. In this paper, we present the design and implementation of Delilah, the first public description of an actual computational storage device supporting eBPF-based code offload. We conduct experiments to evaluate the overhead of eBPF function execution in Delilah, and to explore design options. This study constitutes a baseline for future work.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117179881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Main-Memory Table Scans with Partial Virtual Views","authors":"F. Schuhknecht, Justus Henneberg","doi":"10.1145/3592980.3595315","DOIUrl":"https://doi.org/10.1145/3592980.3595315","url":null,"abstract":"In main-memory column stores, column scans are one of the base operations performed when answering analytical queries. Typically, one or multiple columns must be filtered with respect to the given query predicate, which, by default, involves inspecting all data of the involved columns. To reduce the amount of data to scan, there exist essentially two strategies: (1) Create a coarse-granular index on the column, then use it for early pruning during each scan. While creating such an index is relatively lightweight, unfortunately, accessing the relevant portions of the column through the index causes unpleasant overhead during scanning. (2) Create materialized views that contain semantic portions of the column and filter on these. While this enables fast scans, unfortunately, it requires physical copying and causes significant space overhead. To break this trade-off, in the following, we propose a view-based strategy that avoids any physical copying of column data while providing optimal scan performance. We achieve this by utilizing tools of the virtual memory subsystem provided by the OS: On the lowest level, we materialize all columns within physical main memory. On top of that, we allow the creation of arbitrarily many partial views in virtual memory that map to subsets of the physical columns having certain properties of interest. Creation, maintenance, and usage of these partial virtual views happens fully adaptively as a side-product of scan-based query processing.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115828969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zero-sided RDMA: Network-driven Data Shuffling","authors":"Matthias Jasny, Lasse Thostrup, Carsten Binnig","doi":"10.1145/3592980.3595302","DOIUrl":"https://doi.org/10.1145/3592980.3595302","url":null,"abstract":"In this paper, we present a novel communication scheme called zero-sided RDMA, enabling data exchange as a native network service using a programmable switch. In contrast to one- or two-sided RDMA, in zero-sided RDMA, neither the sender nor the receiver is actively involved in data exchange. Zero-sided RDMA thus enables efficient RDMA-based data shuffling between heterogeneous hardware devices in a disaggregated setup. In our initial evaluation, we show that zero-sided RDMA can outperform existing one-sided RDMA-based schemes due to offloading the coordination to the network and new optimizations that are only possible by coordinating the data exchange on the switch.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132375619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Your Experimental Results Might Be Wrong","authors":"F. Schuhknecht, Justus Henneberg","doi":"10.1145/3592980.3595317","DOIUrl":"https://doi.org/10.1145/3592980.3595317","url":null,"abstract":"Research projects in the database community are often evaluated based on experimental results. A typical evaluation setup looks as follows: Multiple methods to compare with each other are embedded in a single shared benchmarking codebase. In this codebase, all methods execute an identical workload to collect the individual execution times. This seems reasonable: Since the only difference between individual test runs are the methods themselves, any observed time difference can be attributed to these methods. Also, such a benchmarking codebase can be used for gradual optimization: If one method runs slowly, its code can be optimized and re-evaluated. If its performance improves, this improvement can be attributed to the particular optimization. Unfortunately, we had to learn the hard way that it is not that simple. The reason for this lies in a component that sits right between our benchmarking codebase and the produced experimental results — the compiler. As we will see in the following case study, this black-box component has the power to completely ruin any meaningful comparison between methods, even if we setup our experiments as equal and fair as possible.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125547491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roland Kühn, Daniel Biebert, Christian Hakert, Jian-Jia Chen, J. Teubner
{"title":"Towards Data-Based Cache Optimization of B+-Trees","authors":"Roland Kühn, Daniel Biebert, Christian Hakert, Jian-Jia Chen, J. Teubner","doi":"10.1145/3592980.3595316","DOIUrl":"https://doi.org/10.1145/3592980.3595316","url":null,"abstract":"The rise of in-memory databases and systems with considerably large memories and cache sizes requires the rethinking of the proper implementation of index structures like B+-trees in such systems. While disk block-sized nodes and binary search were considered as good in the past, smaller node sizes and cache-friendly linear search within nodes can be noticeably more performant nowadays. Considering the probabilistic distribution of lookup values to the B+-tree as part of a memory-friendly and cache-aware layout is a consequent next step, which is studied in this paper. Favoring frequently visited nodes and paths in the regard of cache hits can improve the overall performance of the tree and, thus, of the entire database system. We provide such an optimized B+-tree layout, which takes the probabilistic distribution of the lookup values as a basis. Experimental evaluation shows that choosing rather small node sizes in combination with our optimization algorithm can improve the performance by up to in comparison to a default baseline.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129977574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}