Proceedings of the 2018 International Conference on Management of Data最新文献_第3页

Top-k Sorting Under Partial Order Information 偏序信息下的Top-k排序

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3199672

Eyal Dushkin, T. Milo

{"title":"Top-k Sorting Under Partial Order Information","authors":"Eyal Dushkin, T. Milo","doi":"10.1145/3183713.3199672","DOIUrl":"https://doi.org/10.1145/3183713.3199672","url":null,"abstract":"We address the problem of sorting the top-k elements of a set, given a predefined partial order over the set elements. Our means to obtain missing order information is via a comparison operator that interacts with a crowd of domain experts to determine the order between two unordered items. The practical motivation for studying this problem is the common scenario where elements cannot be easily compared by machines and thus human experts are harnessed for this task. As some initial partial order is given, our goal is to optimally exploit it in order to minimize the domain experts work. The problem lies at the intersection of two well-studied problems in the theory and crowdsourcing communities:full sorting under partial order information and top-k sorting with no prior partial order information. As we show, resorting to one of the existing state-of-the-art algorithms in these two problems turns out to be extravagant in terms of the number of comparisons performed by the users. In light of this, we present a dedicated algorithm for top-k sorting that aims to minimize the number of comparisons by thoroughly leveraging the partial order information. We examine two possible interpretations of the comparison operator, taken from the theory and crowdsourcing communities, and demonstrate the efficiency and effectiveness of our algorithm for both of them. We further demonstrate the utility of our algorithm, beyond identifying the top-k elements in a dataset, as a vehicle to improve the performance of Learning-to-Rank algorithms in machine learning context. We conduct a comprehensive experimental evaluation in both synthetic and real-world settings.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74801859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Efficient Top-K Query Processing on Massively Parallel Hardware 大规模并行硬件上高效的Top-K查询处理

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3183735

Anil Shanbhag, H. Pirk, S. Madden

引用次数: 44

Session details: Keynote1 会议详情

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3258003

P. Bernstein

引用次数: 0

Session details: Research 6: Storage & Indexing 会议细节:研究6:存储和索引

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3258010

K. A. Ross

引用次数: 0

Session details: Research 13: Machine Learning & Knowledge-base Construction 会议详情:研究13:机器学习与知识库构建

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3258020

Guoliang Li

引用次数: 0

Query-based Workload Forecasting for Self-Driving Database Management Systems 基于查询的自驾车数据库管理系统工作负荷预测

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196908

Lin Ma, Dana Van Aken, Ahmed S. Hefny, Gustavo Mezerhane, Andrew Pavlo, Geoffrey J. Gordon

{"title":"Query-based Workload Forecasting for Self-Driving Database Management Systems","authors":"Lin Ma, Dana Van Aken, Ahmed S. Hefny, Gustavo Mezerhane, Andrew Pavlo, Geoffrey J. Gordon","doi":"10.1145/3183713.3196908","DOIUrl":"https://doi.org/10.1145/3183713.3196908","url":null,"abstract":"The first step towards an autonomous database management system (DBMS) is the ability to model the target application's workload. This is necessary to allow the system to anticipate future workload needs and select the proper optimizations in a timely manner. Previous forecasting techniques model the resource utilization of the queries. Such metrics, however, change whenever the physical design of the database and the hardware resources change, thereby rendering previous forecasting models useless. We present a robust forecasting framework called QueryBot 5000 that allows a DBMS to predict the expected arrival rate of queries in the future based on historical data. To better support highly dynamic environments, our approach uses the logical composition of queries in the workload rather than the amount of physical resources used for query execution. It provides multiple horizons (short- vs. long-term) with different aggregation intervals. We also present a clustering-based technique for reducing the total number of forecasting models to maintain. To evaluate our approach, we compare our forecasting models against other state-of-the-art models on three real-world database traces. We implemented our models in an external controller for PostgreSQL and MySQL and demonstrate their effectiveness in selecting indexes.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79657544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 154

Session details: Industry 2: Real-time Analytics 行业2:实时分析

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3258011

Barzan Mozafari

引用次数: 0

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions DimBoost:提升梯度提升决策树到更高的维度

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3196892

Jiawei Jiang, B. Cui, Ce Zhang, Fangcheng Fu

{"title":"DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions","authors":"Jiawei Jiang, B. Cui, Ce Zhang, Fangcheng Fu","doi":"10.1145/3183713.3196892","DOIUrl":"https://doi.org/10.1145/3183713.3196892","url":null,"abstract":"Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask \"Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?\" The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78813421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter 健壮的，可扩展的，实时事件时间序列聚合在Twitter上

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3190663

Peilin Yang, S. Thiagarajan, Jimmy J. Lin

引用次数: 5

POIsam: a System for Efficient Selection of Large-scale Geospatial Data on Maps POIsam:地图上大规模地理空间数据的高效选择系统

Proceedings of the 2018 International Conference on Management of Data Pub Date : 2018-05-27 DOI: 10.1145/3183713.3193549

Tao Guo, Mingzhao Li, Peishan Li, Z. Bao, G. Cong

{"title":"POIsam: a System for Efficient Selection of Large-scale Geospatial Data on Maps","authors":"Tao Guo, Mingzhao Li, Peishan Li, Z. Bao, G. Cong","doi":"10.1145/3183713.3193549","DOIUrl":"https://doi.org/10.1145/3183713.3193549","url":null,"abstract":"In this demonstration we present POIsam, a visualization system supporting the following desirable features: representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user's interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user's specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user's zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user's current region of interest (which in most cases do not correspond to the pre-defined cells). It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86689664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4