Proceedings of the 2016 International Conference on Management of Data最新文献_第10页

DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine DUALSIM:单机上海量图的并行子图枚举

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915209

Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah

{"title":"DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine","authors":"Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah","doi":"10.1145/2882903.2915209","DOIUrl":"https://doi.org/10.1145/2882903.2915209","url":null,"abstract":"Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82660389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets 数据一夫多妻:城市时空数据集之间的多-多关系

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915245

F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire

{"title":"Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets","authors":"F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire","doi":"10.1145/2882903.2915245","DOIUrl":"https://doi.org/10.1145/2882903.2915245","url":null,"abstract":"The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80307507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Ambry: LinkedIn's Scalable Geo-Distributed Object Store Ambry: LinkedIn的可伸缩地理分布式对象存储

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903738

S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell

{"title":"Ambry: LinkedIn's Scalable Geo-Distributed Object Store","authors":"S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell","doi":"10.1145/2882903.2903738","DOIUrl":"https://doi.org/10.1145/2882903.2903738","url":null,"abstract":"The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77796050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Estimating the Impact of Unknown Unknowns on Aggregate Query Results 估计未知未知数对聚合查询结果的影响

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882909

Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, Tim Kraska

引用次数: 5

Expressive Query Construction through Direct Manipulation of Nested Relational Results 通过直接操作嵌套关系结果构建表达性查询

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915210

Eirik Bakke, David R Karger

{"title":"Expressive Query Construction through Direct Manipulation of Nested Relational Results","authors":"Eirik Bakke, David R Karger","doi":"10.1145/2882903.2915210","DOIUrl":"https://doi.org/10.1145/2882903.2915210","url":null,"abstract":"Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78759289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People 亚德瓦谢姆多来源不确定实体解决方案:将大屠杀受害者报告转化为人

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903737

Tomer Sagi, A. Gal, Omer Barkol, Ruth Bergman, Alexander Avram

引用次数: 10

Augmented Sketch: Faster and More Accurate Stream Processing 增强草图:更快，更准确的流处理

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882948

Pratanu Roy, Arijit Khan, G. Alonso

引用次数: 135

PrivateClean: Data Cleaning and Differential Privacy PrivateClean:数据清理和差异隐私

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915248

S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska

{"title":"PrivateClean: Data Cleaning and Differential Privacy","authors":"S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska","doi":"10.1145/2882903.2915248","DOIUrl":"https://doi.org/10.1145/2882903.2915248","url":null,"abstract":"Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86123559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Efficient and Progressive Group Steiner Tree Search 高效进步的群斯坦纳树搜索

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915217

Ronghua Li, Lu Qin, J. Yu, Rui Mao

{"title":"Efficient and Progressive Group Steiner Tree Search","authors":"Ronghua Li, Lu Qin, J. Yu, Rui Mao","doi":"10.1145/2882903.2915217","DOIUrl":"https://doi.org/10.1145/2882903.2915217","url":null,"abstract":"The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3kn+2k(n log n + m)) time, where k is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of k (e.g., k=8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A*-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"96 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91473943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Accelerating Relational Databases by Leveraging Remote Memory and RDMA 利用远程内存和RDMA加速关系数据库

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882949

Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya

{"title":"Accelerating Relational Databases by Leveraging Remote Memory and RDMA","authors":"Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya","doi":"10.1145/2882903.2882949","DOIUrl":"https://doi.org/10.1145/2882903.2882949","url":null,"abstract":"Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81799012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71