Proceedings of the 2018 International Conference on Management of Data最新文献

筛选
英文 中文
Crowdsourcing Analytics With CrowdCur 使用CrowdCur进行众包分析
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193563
M. Esfandiari, Kavan Patel, S. Amer-Yahia, Senjuti Basu Roy
{"title":"Crowdsourcing Analytics With CrowdCur","authors":"M. Esfandiari, Kavan Patel, S. Amer-Yahia, Senjuti Basu Roy","doi":"10.1145/3183713.3193563","DOIUrl":"https://doi.org/10.1145/3183713.3193563","url":null,"abstract":"We propose to demonstrate CrowdCur xspace, a system that allows platform administrators, requesters, and workers to conduct various analytics of interest. CrowdCur xspace includes a worker curation component that relies on explicit feedback elicitation to best capture workers' preferences, a task curation component that monitors task completion and aggregates their statistics, and an OLAP-style component to query and combine analytics by a worker, by task type, etc. Administrators can fine tune their system's performance. Requesters can compare platforms and better choose the set of workers to target. Workers can compare themselves to others and find tasks and requesters that suit them best.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86083749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Calculation of Optimality Ranges for Relational Query Execution Plans 关系型查询执行计划的最优范围计算
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183742
Florian Wolf, Norman May, P. Willems, K. Sattler
{"title":"On the Calculation of Optimality Ranges for Relational Query Execution Plans","authors":"Florian Wolf, Norman May, P. Willems, K. Sattler","doi":"10.1145/3183713.3183742","DOIUrl":"https://doi.org/10.1145/3183713.3183742","url":null,"abstract":"Cardinality estimation is a crucial task in query optimization and typically relies on heuristics and basic statistical approximations. At execution time, estimation errors might result in situations where intermediate result sizes may differ from the estimated ones, so that the originally chosen plan is not the optimal plan anymore. In this paper we analyze the deviation from the estimate, and denote the cardinality range of an intermediate result, where the optimal plan remains optimal as the optimality range. While previous work used simple heuristics to calculate similar ranges, we generate the precise bounds for the optimality range considering all relevant plan alternatives. Our experimental results show that the fixed optimality ranges used in previous work fail to characterize the range of cardinalities where a plan is optimal. We derive theoretical worst case bounds for the number of enumerated plans required to compute the precise optimality range, and experimentally show that in real queries this number is significantly smaller. Our experiments also show the benefit for applications like Mid-Query Re-Optimization in terms of significant execution time improvement.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79765130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Tighter Upper Bounds for Join Cardinality Estimates 连接基数估计的更严格上界
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183714
Walter Cai
{"title":"Tighter Upper Bounds for Join Cardinality Estimates","authors":"Walter Cai","doi":"10.1145/3183713.3183714","DOIUrl":"https://doi.org/10.1145/3183713.3183714","url":null,"abstract":"1 PROBLEM AND MOTIVATION Despite decades of research, modern database systems still struggle with multijoin queries. Users will often experience long wait times occurring with unpredictable frequency detracting from the usability of the system. In this work we develop a new method to tighten join cardinality upper bounds. The intention for these bounds is to assist the query optimizer (QO) in avoiding expensive physical join plans. Our approach is as follows: leveraging data sketching, and randomized hashing we generate and tighten theoretical join cardinality upper bounds. We outline our base data structures and methodology, and how these bounds may be introduced to a traditional QO framework as a new statistic for physical join plan selection. We evaluate the tightness of our bounds on GooglePlus community graphs and successfully generate degree of magnitude upper bounds even in the presence of multiway cyclic joins.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75281848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SuRF: Practical Range Query Filtering with Fast Succinct Tries SuRF:实用范围查询过滤与快速简洁的尝试
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196931
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, D. Andersen, M. Kaminsky, Kimberly Keeton, Andrew Pavlo
{"title":"SuRF: Practical Range Query Filtering with Fast Succinct Tries","authors":"Huanchen Zhang, Hyeontaek Lim, Viktor Leis, D. Andersen, M. Kaminsky, Kimberly Keeton, Andrew Pavlo","doi":"10.1145/3183713.3196931","DOIUrl":"https://doi.org/10.1145/3183713.3196931","url":null,"abstract":"We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false positive rate.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78478051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing 冷过滤器:一个元框架,更快,更准确的流处理
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183726
Yang Zhou, Tong Yang, Jie Jiang, B. Cui, Minlan Yu, Xiaoming Li, S. Uhlig
{"title":"Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing","authors":"Yang Zhou, Tong Yang, Jie Jiang, B. Cui, Minlan Yu, Xiaoming Li, S. Uhlig","doi":"10.1145/3183713.3183726","DOIUrl":"https://doi.org/10.1145/3183713.3183726","url":null,"abstract":"Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, etc., support numerous applications in databases, storage systems, networking, and other domains. However, the unbalanced distribution in real data streams poses great challenges to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter (CF), that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot items, our filter captures cold items in the first stage, and hot items in the second stage. Also, existing filters require two-direction communication - with frequent exchanges between the two stages; our filter on the other hand is one-direction - each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, giving it a genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on three typical stream processing tasks and experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times. All source code is made publicly available at Github.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78561034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
Accelerating Machine Learning Inference with Probabilistic Predicates 用概率谓词加速机器学习推理
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183751
Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, S. Chaudhuri
{"title":"Accelerating Machine Learning Inference with Probabilistic Predicates","authors":"Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, S. Chaudhuri","doi":"10.1145/3183713.3183751","DOIUrl":"https://doi.org/10.1145/3183713.3183751","url":null,"abstract":"Classic query optimization techniques, including predicate pushdown, are of limited use for machine learning inference queries, because the user-defined functions (UDFs) which extract relational columns from unstructured inputs are often very expensive; query predicates will remain stuck behind these UDFs if they happen to require relational columns that are generated by the UDFs. In this work, we demonstrate constructing and applying probabilistic predicates to filter data blobs that do not satisfy the query predicate; such filtering is parametrized to different target accuracies. Furthermore, to support complex predicates and to avoid per-query training, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments with several machine learning workloads on a big-data cluster show that query processing improves by as much as 10x.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75577452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 85
Fine-grained Concept Linking using Neural Networks in Healthcare 在医疗保健中使用神经网络的细粒度概念链接
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196907
Jian Dai, Meihui Zhang, Gang Chen, Ju Fan, K. Ngiam, B. Ooi
{"title":"Fine-grained Concept Linking using Neural Networks in Healthcare","authors":"Jian Dai, Meihui Zhang, Gang Chen, Ju Fan, K. Ngiam, B. Ooi","doi":"10.1145/3183713.3196907","DOIUrl":"https://doi.org/10.1145/3183713.3196907","url":null,"abstract":"To unlock the wealth of the healthcare data, we often need to link the real-world text snippets to the referred medical concepts described by the canonical descriptions. However, existing healthcare concept linking methods, such as dictionary-based and simple machine learning methods, are not effective due to the word discrepancy between the text snippet and the canonical concept description, and the overlapping concept meaning among the fine-grained concepts. To address these challenges, we propose a Neural Concept Linking (NCL) approach for accurate concept linking using systematically integrated neural networks. We call the novel neural network architecture as the COMposite AttentIonal encode-Decode neural network (COM-AID). COM-AID performs an encode-decode process that encodes a concept into a vector and decodes the vector into a text snippet with the help of two devised contexts. On the one hand, it injects the textual context into the neural network through the attention mechanism, so that the word discrepancy can be overcome from the semantic perspective. On the other hand, it incorporates the structural context into the neural network through the attention mechanism, so that minor concept meaning differences can be enlarged and effectively differentiated. Empirical studies on two real-world datasets confirm that the NCL produces accurate concept linking results and significantly outperforms state-of-the-art techniques.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80129732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
ZigZag: Supporting Similarity Queries on Vector Space Models ZigZag:支持向量空间模型上的相似性查询
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196936
Wenhai Li, Lingfeng Deng, Yang Li, Chen Li
{"title":"ZigZag: Supporting Similarity Queries on Vector Space Models","authors":"Wenhai Li, Lingfeng Deng, Yang Li, Chen Li","doi":"10.1145/3183713.3196936","DOIUrl":"https://doi.org/10.1145/3183713.3196936","url":null,"abstract":"In this paper we study the problem of supporting similarity queries on a large number of records using a vector space model, where each record is a bag of tokens. We consider similarity functions that incorporate non-negative global token weights as well as record-specific token degrees. We develop a family of algorithms based on an inverted index for large data sets, especially for the case of using external storage such as hard disks or flash drives, and present pruning techniques based on various bounds to improve their performance. We formally prove the correctness of these techniques, and show how to achieve better pruning power by iteratively tightening these bounds to exactly filter dissimilar records. We conduct an extensive experimental study using real, large-scale data sets based on different storage platforms, including memory, hard disks, and flash drives. The results show that these algorithms and techniques can efficiently support similarity queries on large data sets.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81770235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Machine Learning for Data Management: Problems and Solutions 数据管理中的机器学习:问题与解决方案
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3199515
Pedro M. Domingos
{"title":"Machine Learning for Data Management: Problems and Solutions","authors":"Pedro M. Domingos","doi":"10.1145/3183713.3199515","DOIUrl":"https://doi.org/10.1145/3183713.3199515","url":null,"abstract":"Machine learning has made great strides in recent years, and its applications are spreading rapidly. Unfortunately, the standard machine learning formulation does not match well with data management problems. For example, most learning algorithms assume that the data is contained in a single table, and consists of i.i.d. (independent and identically distributed) samples. This leads to a proliferation of ad hoc solutions, slow development, and suboptimal results. Fortunately, a body of machine learning theory and practice is being developed that dispenses with such assumptions, and promises to make machine learning for data management much easier and more effective [1]. In particular, representations like Markov logic, which includes many types of deep networks as special cases, allow us to define very rich probability distributions over non-i.i.d., multi-relational data [2]. Despite their generality, learning the parameters of these models is still a convex optimization problem, allowing for efficient solution. Learning structure-in the case of Markov logic, a set of formulas in first-order logic-is intractable, as in more traditional representations, but can be done effectively using inductive logic programming techniques. Inference is performed using probabilistic generalizations of theorem proving, and takes linear time and space in tractable Markov logic, an object-oriented specialization of Markov logic [3]. These techniques have led to state-of-the-art, principled solutions to problems like entity resolution, schema matching, ontology alignment, and information extraction. Using tractable Markov logic, we have extracted from the Web a probabilistic knowledge base with millions of objects and billions of parameters, which can be queried exactly in subsecond times using an RDBMS backend [3]. With these foundations in place, we expect the pace of machine learning applications in data management to continue to accelerate in coming years.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84682798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Auto-Detect: Data-Driven Error Detection in Tables 自动检测:表中数据驱动的错误检测
Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196889
Zhipeng Huang, Yeye He
{"title":"Auto-Detect: Data-Driven Error Detection in Tables","authors":"Zhipeng Huang, Yeye He","doi":"10.1145/3183713.3196889","DOIUrl":"https://doi.org/10.1145/3183713.3196889","url":null,"abstract":"Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose sj, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test sj on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, sj makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85874798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信