Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data最新文献_第2页

Scalable big graph processing in MapReduce MapReduce中可伸缩的大图形处理

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2593661

Lu Qin, J. Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin

{"title":"Scalable big graph processing in MapReduce","authors":"Lu Qin, J. Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin","doi":"10.1145/2588555.2593661","DOIUrl":"https://doi.org/10.1145/2588555.2593661","url":null,"abstract":"MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128984456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

OASSIS: query driven crowd mining OASSIS:查询驱动的人群挖掘

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2610514

Yael Amsterdamer, S. Davidson, T. Milo, Slava Novgorodov, Amit Somech

引用次数: 35

Answering top-k representative queries on graph databases 回答图数据库中top-k个代表性查询

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2610524

Sayan Ranu, Minh X. Hoang, Ambuj K. Singh

{"title":"Answering top-k representative queries on graph databases","authors":"Sayan Ranu, Minh X. Hoang, Ambuj K. Singh","doi":"10.1145/2588555.2610524","DOIUrl":"https://doi.org/10.1145/2588555.2610524","url":null,"abstract":"Given a function that classifies a data object as relevant or irrelevant, we consider the task of selecting k objects that best represent all relevant objects in the underlying database. This problem occurs naturally when analysts want to familiarize themselves with the relevant objects in a database using a small set of k exemplars. In this paper, we solve the problem of top-k representative queries on graph databases. While graph databases model a wide range of scientific data, solving the problem in the context of graphs presents us with unique challenges due to the inherent complexity of matching structures. Furthermore, top-k representative queries map to the classic Set Cover problem, making it NP-hard. To overcome these challenges, we develop a greedy approximation with theoretical guarantees on the quality of the answer set, noting that a better approximation is not feasible in polynomial time. To further optimize the quadratic computational cost of the greedy algorithm, we propose an index structure called NB-Index to index the theta-neighborhoods of the database graphs by employing a novel combination of Lipschitz embedding and agglomerative clustering. Extensive experiments on real graph datasets validate the efficiency and effectiveness of the proposed techniques that achieve up to two orders of magnitude speed-up over state-of-the-art algorithms.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123517990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

HYDRA: large-scale social identity linkage via heterogeneous behavior modeling HYDRA:基于异质行为模型的大规模社会身份关联

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2588559

Siyuan Liu, Shuhui Wang, Feida Zhu, Jinbo Zhang, R. Krishnan

{"title":"HYDRA: large-scale social identity linkage via heterogeneous behavior modeling","authors":"Siyuan Liu, Shuhui Wang, Feida Zhu, Jinbo Zhang, R. Krishnan","doi":"10.1145/2588555.2588559","DOIUrl":"https://doi.org/10.1145/2588555.2588559","url":null,"abstract":"We study the problem of large-scale social identity linkage across different social media platforms, which is of critical importance to business intelligence by gaining from social data a deeper understanding and more accurate profiling of users. This paper proposes HYDRA, a solution framework which consists of three key steps: (I) modeling heterogeneous behavior by long-term behavior distribution analysis and multi-resolution temporal information matching; (II) constructing structural consistency graph to measure the high-order structure consistency on users' core social structures across different platforms; and (III) learning the mapping function by multi-objective optimization composed of both the supervised learning on pair-wise ID linkage information and the cross-platform structure consistency maximization. Extensive experiments on 10 million users across seven popular social network platforms demonstrate that HYDRA correctly identifies real user linkage across different platforms, and outperforms existing state-of-the-art algorithms by at least 20% under different settings, and 4 times better in most settings.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"34 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121005409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 249

The analytical bootstrap: a new method for fast error estimation in approximate query processing 解析自举法:近似查询处理中快速误差估计的一种新方法

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2588579

Kai Zeng, Shi Gao, Barzan Mozafari, C. Zaniolo

{"title":"The analytical bootstrap: a new method for fast error estimation in approximate query processing","authors":"Kai Zeng, Shi Gao, Barzan Mozafari, C. Zaniolo","doi":"10.1145/2588555.2588579","DOIUrl":"https://doi.org/10.1145/2588555.2588579","url":null,"abstract":"Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over \"Big Data\". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114311648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

Versatile optimization of UDF-heavy data flows with sofa 多功能优化的udf重数据流与沙发

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2594517

Astrid Rheinländer, M. Beckmann, Anja Kunkel, Arvid Heise, T. Stoltmann, U. Leser

{"title":"Versatile optimization of UDF-heavy data flows with sofa","authors":"Astrid Rheinländer, M. Beckmann, Anja Kunkel, Arvid Heise, T. Stoltmann, U. Leser","doi":"10.1145/2588555.2594517","DOIUrl":"https://doi.org/10.1145/2588555.2594517","url":null,"abstract":"Currently, we witness an increased interest in large-scale analytical data flows on non-relational data. The predominant building blocks of such data flows are user-defined functions (UDFs), a fact that is not well taken into account for data flow language design and optimization in current systems. In this demonstration, we present Meteor, a declarative data flow language, and Sofa, a logical optimizer for UDF-heavy data flows, which are both part of the Stratosphere system. Meteor queries seamlessly combine self-descriptive, domain-specific operators with standard relational operators. Such queries are optimized by Sofa, building on a concise set of UDF annotations and a small set of rewrite rules to enable semantically equivalent plan rewriting of UDF-heavy data flows. A salient feature of Meteor and Sofa is extensibility: User-defined operators and their properties are arranged into a subsumption hierarchy, which considerably eases integration and optimization of new operators. In this demonstration, we will let users pose arbitrary Meteor queries and graphically showcase versatility and extensibility of Sofa during query optimization.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124339328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age 摩尔驱动的并行性:多核时代的numa感知查询评估框架

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2610507

Viktor Leis, P. Boncz, A. Kemper, Thomas Neumann

{"title":"Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age","authors":"Viktor Leis, P. Boncz, A. Kemper, Thomas Neumann","doi":"10.1145/2588555.2610507","DOIUrl":"https://doi.org/10.1145/2588555.2610507","url":null,"abstract":"With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127680375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 255

Session details: Research session 5: data analytics 研究部分5:数据分析

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/3255752

G. Vossen

引用次数: 0

CrowdMatcher: crowd-assisted schema matching CrowdMatcher:群体辅助模式匹配

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2594515

C. Zhang, Ziyuan Zhao, Lei Chen, H. Jagadish, Caleb Chen Cao

{"title":"CrowdMatcher: crowd-assisted schema matching","authors":"C. Zhang, Ziyuan Zhao, Lei Chen, H. Jagadish, Caleb Chen Cao","doi":"10.1145/2588555.2594515","DOIUrl":"https://doi.org/10.1145/2588555.2594515","url":null,"abstract":"Schema matching is a central challenge for data integration systems. Due to the inherent uncertainty arose from the inability of schema in fully capturing the semantics of the represented data, automatic tools are often uncertain about suggested matching results. However, human is good at understanding data represented in various forms and crowdsourcing platforms are making the human annotation process more affordable. Thus in this demo, we will show how to utilize the crowd to find the right matching. In order to do that, we need to make the tasks posted on the crowdsouricng platforms extremely simple, to be performed by non-expert people, and reduce the number of tasks as less as possible to save the cost. We demonstrate CrowdMatcher, a hybrid machine-crowd system for schema matching. The machine-generated matchings are verified by correspondence correctness queries (CCQs), which is to ask the crowd to determine whether a given correspondence is correct or not. CrowdMatcher includes several original features: it integrates different matchings generated from classical schema matching tools; in order to minimize the cost of crowdsourcing, it automatically selects the most informative set of CCQs from the possible matchings; it is able to manage inaccurate answers provided by the workers; the crowdsourced answers are used to improve matching results.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134159183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A formal approach to finding explanations for database queries 一种寻找数据库查询解释的正式方法

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI: 10.1145/2588555.2588578

Sudeepa Roy, Dan Suciu

引用次数: 144