Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, Wolfgang Lehner
{"title":"Big data causing big (TLB) problems: taming random memory accesses on the GPU","authors":"Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, Wolfgang Lehner","doi":"10.1145/3076113.3076115","DOIUrl":"https://doi.org/10.1145/3076113.3076115","url":null,"abstract":"GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Profiling a GPU database implementation: a holistic view of GPU resource utilization on TPC-H queries","authors":"Emily Furst, M. Oskin, Bill Howe","doi":"10.1145/3076113.3076119","DOIUrl":"https://doi.org/10.1145/3076113.3076119","url":null,"abstract":"General Purpose computing on Graphics Processing Units (GPGPU) has become an increasingly popular option for accelerating database queries. However, GPUs are not well-suited for all types of queries as data transfer costs can often dominate query execution. We develop a methodology for quantifying how well databases utilize GPU architectures using proprietary profiling tools. By aggregating various profiling metrics, we break down the different aspects that comprise occupancy on the GPU across the runtime of query execution. We show that for the Alenka GPU database, only a small minority of execution time, roughly 5% is spent on the GPU. We further show that even on queries with seemingly good performance, a large portion of the achieved occupancy can actually be attributed to stalls and scalar instructions.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117166151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling column imprints using advanced vectorization","authors":"Lefteris Sidirourgos, H. Mühleisen","doi":"10.1145/3076113.3076120","DOIUrl":"https://doi.org/10.1145/3076113.3076120","url":null,"abstract":"Column Imprints is a pre-filtering secondary index for answering range queries. The main feature of imprints is that they are light-weight and are based on compressed bit-vectors, one per cacheline, that quickly determine if the values in that cacheline satisfy the predicates of a query. The main overhead of the imprints implementation is the many sequential value comparisons against the boundaries of a virtual equi-height histogram. Similarly, during query scans, many sequential value comparisons are performed to identify false positives. In this paper, we speed-up the process of imprints creation and querying by using advanced vectorization techniques. We also experimentally explore the benefits of stretching imprints to larger bit-vector sizes and blocks of data, using 256-bit SIMD registers. Our findings are very promising for both imprints and for future index design research that would employ advanced vectorization techniques and larger (up to 512-bit) and more (from 16 now to 32) SIMD registers.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128209957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analysis of memory power consumption in database systems","authors":"A.Ye. Karyakin, K. Salem","doi":"10.1145/3076113.3076117","DOIUrl":"https://doi.org/10.1145/3076113.3076117","url":null,"abstract":"The growing appetite for in-memory computing is increasing memory's share of total server power consumption. However, memory power consumption in database management systems is not well understood. This paper presents an empirical characterization of memory power consumption in database systems, for both analytical and transactional workloads. Our results indicate that memory power optimization will be effective only if it can reduce back-ground power through more aggressive use of low power memory idle states.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129246436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A methodology for OLTP micro-architectural analysis","authors":"Utku Sirin, Ahmad Yasin, A. Ailamaki","doi":"10.1145/3076113.3076116","DOIUrl":"https://doi.org/10.1145/3076113.3076116","url":null,"abstract":"Micro-architectural analysis is critical to investigate the interaction between workloads and processors. While today's aggressive out-of-order processors provide a rich set of performance events for deep execution cycle analysis, OLTP characterization studies usually use a cache-miss-based method (CMBM). In this work, we investigate the validity and the functionality of CMBM by comparing it with Intel's state-of-the-art Top-down Micro-architecture Analysis Method (TMAM) for OLTP workloads. We show that, while CMBM and TMAM provide a similar high-level micro-architectural behavior, it is inadequate for a fine-grained micro-architectural analysis. We further show that TMAM underestimates memory stalls. We optimize TMAM's execution cycle breakdown, and improve its estimation of memory stalls up to 50%.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114781992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deadlock-free joins in DB-mesh, an asynchronous systolic array accelerator","authors":"Bingyi Cao, K. A. Ross, S. Edwards, Martha A. Kim","doi":"10.1145/3076113.3076118","DOIUrl":"https://doi.org/10.1145/3076113.3076118","url":null,"abstract":"Previous database accelerator proposals such as the Q100 provide a fixed set of database operators, chosen to support a target query workload. Some queries may not be well-supported by a fixed accelerator, typically because they need more resources/operators of a particular kind than the accelerator provides. By Amdahl's law, these queries become relatively more expensive as they are not fully accelerated. We propose a second-level accelerator, DB-Mesh, to take up some of this workload. DB-Mesh is an asynchronous systolic array that is more generic than the Q100, and can be configured to run a variety of operators with configurable parameters such as record widths. We demonstrate DB-Mesh applied to nested loops joins, an operator that is not directly supported on the Q100. We show that a naïve implementation has the potential for deadlock, and show how to avoid deadlock with a careful design. We also demonstrate how the data flow policy used in the array influences system throughput.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126479938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kayhan Dursun, Carsten Binnig, U. Çetintemel, R. Petrocelli
{"title":"SiliconDB: rethinking DBMSs for modern heterogeneous co-processor environments","authors":"Kayhan Dursun, Carsten Binnig, U. Çetintemel, R. Petrocelli","doi":"10.1145/3076113.3076124","DOIUrl":"https://doi.org/10.1145/3076113.3076124","url":null,"abstract":"In the last decade, the work centered around specialized co-processors for DBMSs has largely focused on efficient query processing algorithms for individual operators. However, a major limitation of existing co-processor systems is the PCI bottleneck, which severely limits the efficient use of this type of hardware in current systems. In recent years, we have seen the emergence of a new class of co-processor systems that include specialized accelerators, implemented as ASICs or FPGAs, which co-reside with the CPU on the same socket. Here we revisit DBMS architectures in this context, and take an initial step towards the design of a new database system called SiliconDB that targets these new densely integrated heterogeneous co-processor environments.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Lersch, Ismail Oukid, Wolfgang Lehner, I. Schreter
{"title":"An analysis of LSM caching in NVRAM","authors":"L. Lersch, Ismail Oukid, Wolfgang Lehner, I. Schreter","doi":"10.1145/3076113.3076123","DOIUrl":"https://doi.org/10.1145/3076113.3076123","url":null,"abstract":"The rise of NVRAM technologies promises to change the way we think about system architectures. In order to fully exploit its advantages, it is required to develop systems specially tailored for NVRAM devices. Not only this imposes great challenges, but developing full system architectures from scratch is undesirable in many scenarios due to prohibitive development costs. Instead, we analyze in this paper the behavior of an existing log-structured persistent key-value store, namely LevelDB, when run on top of an emulated NVRAM device. We investigate initial opportunities for improvement when adapting a system tailored for HDD/SSDs to run on top of an NVRAM environment. Furthermore, we analyze the behavior of the legacy DRAM caching component of LevelDB and whether more suitable caching policies are required.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster across the PCIe bus: a GPU library for lightweight decompression: including support for patched compression schemes","authors":"Eyal Rozenberg, P. Boncz","doi":"10.1145/3076113.3076122","DOIUrl":"https://doi.org/10.1145/3076113.3076122","url":null,"abstract":"This short paper present a collection of GPU lightweight decompression algorithms implementations within a FOSS library, Giddy - the first to be published to offer such functionality. As the use of compression is important in ameliorating PCIe data transfer bottlenecks, we believe this library and its constituent implementations can serve as useful building blocks in GPU-accelerated DBMSes --- as well as other data-intensive systems. The paper also includes an initial exploration of GPU-oriented patched compression schemes. Patching makes compression ratio robust against outliers, and is important with real-life data, which (in contrast to many synthetic benchmark datasets) exhibits non-uniform data distributions and noise. An experimental evaluation of both the unpatched and the patched schemes in Giddy is included.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133712857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A PetriNet mechanism for OLAP in NUMA","authors":"Simone Dominico, E. Almeida, J. Meira","doi":"10.1145/3076113.3076121","DOIUrl":"https://doi.org/10.1145/3076113.3076121","url":null,"abstract":"In the parallel execution of queries in Non-Uniform Memory Access (NUMA), the operating system maps database processes/threads (i.e., workers) to the available cores across the NUMA nodes. However, this mapping results in poor cache activity with many minor page faults and slower query response time when workers and data are allocated in different NUMA nodes. The system needs to move large volumes of data around the NUMA nodes to catch up with the running workers. Our hypothesis is that we mitigate the data movement to boost cache hits and response time if we only hand out to the system the local optimum number of cores instead of all the available ones. In this paper we present a PetriNet mechanism that represents the load of the database workers for dynamically computing and allocating the local optimum number of CPU cores to tackle such load. Preliminary results show that data movement diminishes with the local optimum number of CPU cores.","PeriodicalId":185720,"journal":{"name":"Proceedings of the 13th International Workshop on Data Management on New Hardware","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122258330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}