A Method for Optimizing Opaque Filter Queries

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-05-29 DOI:10.1145/3318464.3389766

Wenjia He, Michael R. Anderson, M. Strome, Michael J. Cafarella

{"title":"A Method for Optimizing Opaque Filter Queries","authors":"Wenjia He, Michael R. Anderson, M. Strome, Michael J. Cafarella","doi":"10.1145/3318464.3389766","DOIUrl":null,"url":null,"abstract":"An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical \"query-independent\" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3389766","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical "query-independent" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.

查看原文本刊更多论文

一种优化不透明筛选查询的方法

在机器学习和数据科学工作负载中，一类重要的数据库查询是不透明过滤器查询:这种查询带有使用UDF实现的选择谓词，其语义对于查询优化器是未知的。一些典型的例子包括cnn风格的训练图像分类器，或文本情感分类器。因为优化器不知道谓词的语义，所以它不能采用标准的优化，从而导致较长的查询时间。我们提出了voodoo索引，这是一种优化不透明过滤器查询的两阶段方法。在任何查询到达之前，该方法构建数据库内容的分层“独立于查询”索引，该索引将相似的对象分组在一起。在查询时，该方法构建一个映射，表示每个组满足谓词的程度，同时还利用该映射来加速执行。与过去的方法不同，voodoo索引不需要深入了解谓词语义，可以处理任何数据类型，并且不需要查询内模型训练。我们描述了独立的和特定于sparksql的实现，以及在100多个不同的不透明谓词上对图像和文本数据进行的实验。我们显示巫毒索引可以产生高达88%的改进比标准扫描行为，和79%的改进比以前的最佳方法改编自研究文献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量