Wenhai Li , Zheng Yang , Lingfeng Deng , Zhiling Cheng , Weidong Wen , Yanxiang He
{"title":"Accelerating Columnar Storage Based on Asynchronous Skipping Strategy","authors":"Wenhai Li , Zheng Yang , Lingfeng Deng , Zhiling Cheng , Weidong Wen , Yanxiang He","doi":"10.1016/j.bdr.2022.100352","DOIUrl":null,"url":null,"abstract":"<div><p>Many database applications, such as OnLine Analytical Processing (OLAP), web-based information extraction or scientific computation, need to select a subset of fields based on several user-defined filters. Developers of these applications require effective assembly methods for on-demand filtering and aggregation, which raises new challenges in deploying parallel computing components on top of columnar storage.</p><p>To efficiently generate qualified records, an asynchronous skipping strategy is presented to speed up filtering and decoding in the column-based storage. Concentrating on filtering-pushdown in parallel analytical workloads, we offer in-depth analysis on record assembly. We highlight the bottleneck of traditional record-wise assembling methods in the cases of evaluating analytical tasks on a nested schema. With a concurrent queue structure, an asynchronous skipping strategy is presented to evaluate column scan separately by a software pipeline involving an optionally different set of threads. We show how to intensively read the sequential blocks of each column, and how to effectively eliminate invalid payloads by integrating filtering-pushdown in an asynchronous I/O stack.</p><p>We implement a columnar storage supporting filtering-pushdown in nested schema. Our experiments are conducted on a de-facto standard benchmark using both variant-selectivity scans and ad-hoc queries. The results revealed that in parallel I/O-intensive workloads, our implementation improved the I/O performance of the state-of-the arts by 1.3X∼2.7X. Coupling the asynchronous strategy with filtering-pushdown, our implementation remarkably outperforms its competitors with heavyweight coding workloads on both HDD and SSD.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"31 ","pages":"Article 100352"},"PeriodicalIF":3.5000,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579622000466","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Many database applications, such as OnLine Analytical Processing (OLAP), web-based information extraction or scientific computation, need to select a subset of fields based on several user-defined filters. Developers of these applications require effective assembly methods for on-demand filtering and aggregation, which raises new challenges in deploying parallel computing components on top of columnar storage.
To efficiently generate qualified records, an asynchronous skipping strategy is presented to speed up filtering and decoding in the column-based storage. Concentrating on filtering-pushdown in parallel analytical workloads, we offer in-depth analysis on record assembly. We highlight the bottleneck of traditional record-wise assembling methods in the cases of evaluating analytical tasks on a nested schema. With a concurrent queue structure, an asynchronous skipping strategy is presented to evaluate column scan separately by a software pipeline involving an optionally different set of threads. We show how to intensively read the sequential blocks of each column, and how to effectively eliminate invalid payloads by integrating filtering-pushdown in an asynchronous I/O stack.
We implement a columnar storage supporting filtering-pushdown in nested schema. Our experiments are conducted on a de-facto standard benchmark using both variant-selectivity scans and ad-hoc queries. The results revealed that in parallel I/O-intensive workloads, our implementation improved the I/O performance of the state-of-the arts by 1.3X∼2.7X. Coupling the asynchronous strategy with filtering-pushdown, our implementation remarkably outperforms its competitors with heavyweight coding workloads on both HDD and SSD.
期刊介绍:
The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic.
The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.