Proceedings of the 29th International Conference on Scientific and Statistical Database Management最新文献_第2页

Bi-Level Online Aggregation on Raw Data 原始数据的双级在线聚合

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085514

Yu Cheng, Weijie Zhao, Florin Rusu

{"title":"Bi-Level Online Aggregation on Raw Data","authors":"Yu Cheng, Weijie Zhao, Florin Rusu","doi":"10.1145/3085504.3085514","DOIUrl":"https://doi.org/10.1145/3085504.3085514","url":null,"abstract":"In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. The performance of in-situ processing across a query workload is, however, limited by the speed of full scan, tokenizing, and parsing of the entire data. Online aggregation (OLA) has been introduced as an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing---the computation can be stopped as early as the estimate is accurate enough to be deemed uninteresting. However, existing OLA solutions have a high upfront cost of randomly shuffling and/or sampling the data. In this paper, we present OLA-RAW, a bi-level sampling scheme for parallel online aggregation over raw data. Sampling in OLA-RAW is query-driven and performed exclusively in-situ during the runtime query execution, without data reorganization. This is realized by a novel resource-aware bi-level sampling algorithm that processes data in random chunks concurrently and determines adaptively the number of sampled tuples inside a chunk. In order to avoid the cost of repetitive conversion from raw data, OLA-RAW builds and maintains a memory-resident bi-level sample synopsis incrementally. We implement OLA-RAW inside a modern in-situ data processing system and evaluate its performance across several real and synthetic datasets and file formats. Our results show that OLA-RAW chooses the sampling plan that minimizes the execution time and guarantees the required accuracy for each query in a given workload. The end result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116917566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

DualDB: An Efficient LSM-based Publish/Subscribe Storage System DualDB:基于lsm的高效发布/订阅存储系统

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085528

Mohiuddin Abdul Qader, Vagelis Hristidis

{"title":"DualDB: An Efficient LSM-based Publish/Subscribe Storage System","authors":"Mohiuddin Abdul Qader, Vagelis Hristidis","doi":"10.1145/3085504.3085528","DOIUrl":"https://doi.org/10.1145/3085504.3085528","url":null,"abstract":"Publish/Subscribe systems allow subscribers to monitor for events of interest generated by publishers. Current publish/subscribe query systems are efficient when the subscriptions (queries) are relatively static -- for instance, the set of followers in Twitter -- or can fit in memory. However, an increasing number of applications in this era of Big Data and Internet of Things (IoT) are based on a highly dynamic query paradigm, where continuous queries are in the millions and are created and expire in a rate comparable, or even higher, to that of the data (event) entries. For instance moving objects like airplanes, cars or sensors may continuously generate measurement data like air pressure or traffic, which are consumed by other moving objects. In this paper we propose and compare a novel publish/subscribe storage architecture, DualDB, based on the popular NoSQL Log-Structured Merge Tree (LSM) storage paradigm, to support high-throughput and dynamic publish/subscribe systems. Our method naturally supports queries on both past and future data, and generate instant notifications, which are desirable properties missing from many previous systems. We implemented and experimentally evaluated our methods on the popular LSM-based LevelDB system, using real datasets. Our results show that we can achieve significantly higher throughput compared to state-of-the-art baselines.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128178520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SHRec: Scalable Holistic Recommendation SHRec:可扩展的整体推荐

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085523

Ahmed M. Aly, M. Hammad, Amr Ahmed

{"title":"SHRec: Scalable Holistic Recommendation","authors":"Ahmed M. Aly, M. Hammad, Amr Ahmed","doi":"10.1145/3085504.3085523","DOIUrl":"https://doi.org/10.1145/3085504.3085523","url":null,"abstract":"The problem of recommending items to users is of high practical importance. For instance, many web services try to find relevant recommendations for the users, e.g., finding relevant movies, social-media friends, restaurants, shopping items, etc. The expansion of the Web and the ever-growing number of people who use web services render the problem of recommendation challenging. The Locality Sensitive Hashing (LSH, for short) is the most known scalable technique for nearest-neighbor search in high dimensional data, and hence the LSH is widely used in most industrial recommendation systems. This paper presents an implementation of the LSH using Google's MapReduce engine. We apply the LSH to a real case study at Google, where we recommend for each web-host a set of outlinks based on the outlink similarity amongst the web-hosts. We identify some performance limitations of the LSH that occur due to specific properties in the data, and that become significant when the scale of the data is large. Furthermore, we present SHRec, a novel technique for scalable recommendation that addresses these performance limitations. Based on real deployment of both SHRec and LSH on Google's infrastructure, and using real data of the crawled Web at Google, where a sample host-level graph of 1.5 Billion web-hosts is extracted, we demonstrate that SHRec is more scalable than LSH. In particular, we show that SHRec is one order of magnitude faster than LSH while achieving better recommendation quality.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131921692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Series Similarity Using Correlation-Aware Measures 使用关联感知度量的数据序列相似性

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085515

Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas

{"title":"Data Series Similarity Using Correlation-Aware Measures","authors":"Katsiaryna Mirylenka, Michele Dallachiesa, Themis Palpanas","doi":"10.1145/3085504.3085515","DOIUrl":"https://doi.org/10.1145/3085504.3085515","url":null,"abstract":"The increased availability of unprecedented amounts of sequential data (generated by Internet-of-Things, as well as scientific applications) has led in the past few years to a renewed interest and attention to the field of data series processing and analysis. Data series collections are processed and analyzed using a large variety of techniques, most of which are based on the computation of some distance function. In this study, we revisit this basic operation of data series distance calculation. We observe that the popular distance measures are oblivious to the correlations inherent in neighboring values in a data series. Therefore, we evaluate the plausibility and benefit of incorporating into the distance function measures of correlation, which enable us to capture the associations among neighboring values in the sequence. We propose four such measures, inspired by statistical and probabilistic approaches, which can effectively model these correlations. We analytically and experimentally demonstrate the benefits of the new measures using the 1NN classification task, and discuss the lessons learned. Finally, we propose future research directions for enabling the proposed measures to be used in practice.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Discovering Partial Periodic Itemsets in Temporal Databases 在时态数据库中发现部分周期项集

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085535

R. U. Kiran, Haichuan Shang, Masashi Toyoda, M. Kitsuregawa

{"title":"Discovering Partial Periodic Itemsets in Temporal Databases","authors":"R. U. Kiran, Haichuan Shang, Masashi Toyoda, M. Kitsuregawa","doi":"10.1145/3085504.3085535","DOIUrl":"https://doi.org/10.1145/3085504.3085535","url":null,"abstract":"A temporal database is a collection of transactions, ordered by their timestamps. Discovering partial periodic itemsets in temporal databases has numerous applications. However, to the best of our knowledge, no work has considered finding these itemsets in temporal databases, despite that this type of data is very common in real-life. Discovering partial periodic itemsets in temporal databases is challenging. It requires defining (i) an appropriate measure to assess the periodic interestingness of itemsets, and (ii) an algorithm to efficiently find all partial periodic itemsets. While a pattern-growth algorithm can be employed for the second sub-task, the first sub-task has not been addressed. Moreover, how these two tasks are combined has significant implications. In this paper, we address this challenge. We introduce a model to find partial periodic itemsets in temporal databases. A new measure, called periodic-frequency, has been proposed to determine the periodic interestingness of itemsets by taking into account their number of cyclic repetitions in the entire data. Moreover, the paper introduces a pattern-growth algorithm to discover all partial periodic itemsets. Experimental results demonstrate that our model is efficient.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126075460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

How the Passengers Flow in Complex Metro Networks? 复杂地铁网络中乘客如何流动?

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085527

Guandong Sun, Yun Xiong, Yangyong Zhu

引用次数: 5

Active Learning with Density-Initialized Decision Tree for Record Matching 基于密度初始化决策树的主动学习记录匹配

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085518

Chenxiao Dou, Daniel W. Sun, Guoqiang Li, R. Wong

引用次数: 4

Tiling Strategies for Distributed Point Cloud Databases 分布式点云数据库的平铺策略

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085537

J. Szalai-Gindl, L. Dobos, I. Csabai

引用次数: 2

PLI: Augmenting Live Databases with Custom Clustered Indexes PLI:使用自定义聚类索引增强实时数据库

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085582

J. Wagner, A. Rasin, Dai Hai Ton That, T. Malik

引用次数: 4

Detecting Global Hyperparaboloid Correlated Clusters Based on Hough Transform 基于Hough变换的全局超抛物面相关聚类检测

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085536

Daniyal Kazempour, Markus Mauder, Peer Kröger, T. Seidl

引用次数: 6