Proceedings of the 2016 International Conference on Management of Data最新文献_第9页

Distributed Set Reachability 分布式集可达性

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915226

Sairam Gurajada, M. Theobald

{"title":"Distributed Set Reachability","authors":"Sairam Gurajada, M. Theobald","doi":"10.1145/2882903.2915226","DOIUrl":"https://doi.org/10.1145/2882903.2915226","url":null,"abstract":"In this paper, we focus on the efficient and scalable processing of set-reachability queries over a distributed, directed data graph. A \"set-reachability query\" is a generalized form of a reachability query, in which we consider two sets S and T of source and target vertices, respectively, to be given as the query. The result of a set-reachability query are all pairs of source and target vertices (s, t), with s -- S and t #8712; T, where s is reachable to t (denoted as S ↝ T). In case the data graph is partitioned into multiple, edge- and vertex-disjoint subgraphs (e.g., when distributed across multiple compute nodes in a cluster), we refer to the resulting set-reachability problem as \"distributed set reachability\". The key goal in processing a distributed set-reachability query over a partitioned data graph both efficiently and in a scalable manner is (1) to avoid redundant computations within the local compute nodes as much as possible, (2) to partially evaluate the local components of a set-reachability query S ↝ T among all compute nodes in parallel, and (3) to minimize both the size and number of messages exchanged among the compute nodes. Distributed set reachability has a plethora of applications in graph analytics and for query processing. The current W3C recommendation for SPARQL 1.1, for example, introduces a notion of \"labeled property paths\" which resolves to processing a form of generalized graph-pattern queries with set-reachability predicates. Moreover, analyzing dependencies among \"social-network communities\" inherently involves reachability checks between large sets of source and target vertices. Our experiments confirm very significant performance gains of our approach in comparison to state-of-the-art graph engines such as Giraph++, and over a variety of graph collections with up to 1.4 billion edges.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82330803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Sharing-Aware Outlier Analytics over High-Volume Data Streams 大容量数据流上的共享感知异常值分析

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882920

Lei Cao, Jiayuan Wang, Elke A. Rundensteiner

{"title":"Sharing-Aware Outlier Analytics over High-Volume Data Streams","authors":"Lei Cao, Jiayuan Wang, Elke A. Rundensteiner","doi":"10.1145/2882903.2882920","DOIUrl":"https://doi.org/10.1145/2882903.2882920","url":null,"abstract":"Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multi-query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear win-win. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86695054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Publishing Attributed Social Graphs with Formal Privacy Guarantees 发布具有正式隐私保证的属性社交图

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915215

Zach Jorgensen, Ting Yu, Graham Cormode

引用次数: 63

FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory FPTree:一种用于存储类内存的混合SCM-DRAM持久和并发b树

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915251

Ismail Oukid, Johan Lasperas, A. Nica, Thomas Willhalm, Wolfgang Lehner

{"title":"FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory","authors":"Ismail Oukid, Johan Lasperas, A. Nica, Thomas Willhalm, Wolfgang Lehner","doi":"10.1145/2882903.2915251","DOIUrl":"https://doi.org/10.1145/2882903.2915251","url":null,"abstract":"The advent of Storage Class Memory (SCM) is driving a rethink of storage systems towards a single-level architecture where memory and storage are merged. In this context, several works have investigated how to design persistent trees in SCM as a fundamental building block for these novel systems. However, these trees are significantly slower than DRAM-based counterparts since trees are latency-sensitive and SCM exhibits higher latencies than DRAM. In this paper we propose a novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts. In this novel design, leaf nodes are persisted in SCM while inner nodes are placed in DRAM and rebuilt upon recovery. The FPTree uses Fingerprinting, a technique that limits the expected number of in-leaf probed keys to one. In addition, we propose a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory. We conduct a thorough performance evaluation and show that the FPTree outperforms state-of-the-art persistent trees with different SCM latencies by up to a factor of 8.2. Moreover, we show that the FPTree scales very well on a machine with 88 logical cores. Finally, we integrate the evaluated trees in memcached and a prototype database. We show that the FPTree incurs an almost negligible performance overhead over using fully transient data structures, while significantly outperforming other persistent trees.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89903421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 291

SparkR: Scaling R Programs with Spark SparkR:用Spark扩展R程序

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903740

S. Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, H. Falaki, Xiangrui Meng, Reynold Xin, A. Ghodsi, M. Franklin, I. Stoica, M. Zaharia

引用次数: 67

Robust and Noise Resistant Wrapper Induction 坚固和抗噪声封装感应

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915214

Tim Furche, Jinsong Guo, S. Maneth, C. Schallhart

{"title":"Robust and Noise Resistant Wrapper Induction","authors":"Tim Furche, Jinsong Guo, S. Maneth, C. Schallhart","doi":"10.1145/2882903.2915214","DOIUrl":"https://doi.org/10.1145/2882903.2915214","url":null,"abstract":"Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73623955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Closing the functional and Performance Gap between SQL and NoSQL 缩小SQL和NoSQL在功能和性能上的差距

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903731

Z. Liu, B. Hammerschmidt, Douglas Mcmahon, Y. Liu, Hui J. Chang

引用次数: 38

Efficient Subgraph Matching by Postponing Cartesian Products 延迟笛卡尔积的高效子图匹配

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915236

Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, W. Zhang

{"title":"Efficient Subgraph Matching by Postponing Cartesian Products","authors":"Fei Bi, Lijun Chang, Xuemin Lin, Lu Qin, W. Zhang","doi":"10.1145/2882903.2915236","DOIUrl":"https://doi.org/10.1145/2882903.2915236","url":null,"abstract":"In this paper, we study the problem of subgraph matching that extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. The existing algorithms for subgraph matching follow Ullmann's backtracking approach; that is, iteratively map query vertices to data vertices by following a matching order of query vertices. It has been shown that the matching order of query vertices is a very important aspect to the efficiency of a subgraph matching algorithm. Recently, many advanced techniques, such as enforcing connectivity and merging similar vertices in query or data graphs, have been proposed to provide an effective matching order with the aim to reduce unpromising intermediate results especially the ones caused by redundant Cartesian products. In this paper, for the first time we address the issue of unpromising results by Cartesian products from \"dissimilar\" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. Our second contribution is proposing a new path-based auxiliary data structure, with the size O(|E(G)| x |V(q)|), to generate a matching order and conduct subgraph matching, which significantly reduces the exponential size O(|V(G)||V(q)|-1) of the existing path-based auxiliary data structure, where V (G) and E (G) are the vertex and edge sets of a data graph G, respectively, and V (q) is the vertex set of a query $q$. Extensive empirical studies on real and synthetic graphs demonstrate that our techniques outperform the state-of-the-art algorithms by up to $3$ orders of magnitude.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"516 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77120933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 205

Privacy Preserving Subgraph Matching on Large Graphs in Cloud 云中大图的保密性子图匹配

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882956

Zhao Chang, Lei Zou, Feifei Li

{"title":"Privacy Preserving Subgraph Matching on Large Graphs in Cloud","authors":"Zhao Chang, Lei Zou, Feifei Li","doi":"10.1145/2882903.2882956","DOIUrl":"https://doi.org/10.1145/2882903.2882956","url":null,"abstract":"The wide presence of large graph data and the increasing popularity of storing data in the cloud drive the needs for graph query processing on a remote cloud. But a fundamental challenge is to process user queries without compromising sensitive information. This work focuses on privacy preserving subgraph matching in a cloud server. The goal is to minimize the overhead on both cloud and client sides for subgraph matching, without compromising users' sensitive information. To that end, we transform an original graph $G$ into a privacy preserving graph Gk, which meets the requirement of an existing privacy model known as k-automorphism. By making use of the symmetry in a k-automorphic graph, a subgraph matching query can be efficiently answered using a graph Go, a small subset of Gk. This approach saves both space and query cost in the cloud server. We also anonymize the query graphs to protect their label information using label generalization technique. To reduce the search space for a subgraph matching query, we propose a cost model to select the more effective label combinations. The effectiveness and efficiency of our method are demonstrated through extensive experimental results on real datasets.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88812737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Learning Linear Regression Models over Factorized Joins 学习因式连接上的线性回归模型

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882939

Maximilian Schleich, Dan Olteanu, Radu Ciucanu

{"title":"Learning Linear Regression Models over Factorized Joins","authors":"Maximilian Schleich, Dan Olteanu, Radu Ciucanu","doi":"10.1145/2882903.2882939","DOIUrl":"https://doi.org/10.1145/2882903.2882939","url":null,"abstract":"We investigate the problem of building least squares regression models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins. We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this approach: F/FDB computes the cofactors in one pass over the materialized factorized join; Favoids this materialization and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query. Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Experiments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90280606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 164