Proceedings of the 2016 International Conference on Management of Data最新文献

筛选
英文 中文
DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine DUALSIM:单机上海量图的并行子图枚举
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915209
Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah
{"title":"DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine","authors":"Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah","doi":"10.1145/2882903.2915209","DOIUrl":"https://doi.org/10.1145/2882903.2915209","url":null,"abstract":"Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82660389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets 数据一夫多妻:城市时空数据集之间的多-多关系
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915245
F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire
{"title":"Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets","authors":"F. Chirigati, Harish Doraiswamy, T. Damoulas, J. Freire","doi":"10.1145/2882903.2915245","DOIUrl":"https://doi.org/10.1145/2882903.2915245","url":null,"abstract":"The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these relationships is difficult. First, a relationship between two data sets may occur only at certain locations and/or time periods. Second, the sheer number and size of the data sets, coupled with the diverse spatial and temporal scales at which the data is available, presents computational challenges on all fronts, from indexing and querying to analyzing them. Finally, it is non-trivial to differentiate between meaningful and spurious relationships. To address these challenges, we propose Data Polygamy, a scalable topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets. We have performed an experimental evaluation using over 300 spatial-temporal urban data sets which shows that our approach is scalable and effective at identifying interesting relationships.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80307507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Ambry: LinkedIn's Scalable Geo-Distributed Object Store Ambry: LinkedIn的可伸缩地理分布式对象存储
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903738
S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell
{"title":"Ambry: LinkedIn's Scalable Geo-Distributed Object Store","authors":"S. Noghabi, Sriram Ganapathi Subramanian, Priyesh Narayanan, Sivabalan Narayanan, G. Holla, M. Zadeh, Tianwei Li, Indranil Gupta, R. Campbell","doi":"10.1145/2882903.2903738","DOIUrl":"https://doi.org/10.1145/2882903.2903738","url":null,"abstract":"The infrastructure beneath a worldwide social network has to continually serve billions of variable-sized media objects such as photos, videos, and audio clips. These objects must be stored and served with low latency and high throughput by a system that is geo-distributed, highly scalable, and load-balanced. Existing file systems and object stores face several challenges when serving such large objects. We present Ambry, a production-quality system for storing large immutable data (called blobs). Ambry is designed in a decentralized way and leverages techniques such as logical blob grouping, asynchronous replication, rebalancing mechanisms, zero-cost failure detection, and OS caching. Ambry has been running in LinkedIn's production environment for the past 2 years, serving up to 10K requests per second across more than 400 million users. Our experimental evaluation reveals that Ambry offers high efficiency (utilizing up to 88% of the network bandwidth), low latency (less than 50 ms latency for a 1 MB object), and load balancing (improving imbalance of request rate among disks by 8x-10x).","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77796050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Estimating the Impact of Unknown Unknowns on Aggregate Query Results 估计未知未知数对聚合查询结果的影响
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882909
Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, Tim Kraska
{"title":"Estimating the Impact of Unknown Unknowns on Aggregate Query Results","authors":"Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, Tim Kraska","doi":"10.1145/2882903.2882909","DOIUrl":"https://doi.org/10.1145/2882903.2882909","url":null,"abstract":"It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results? In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80907424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Expressive Query Construction through Direct Manipulation of Nested Relational Results 通过直接操作嵌套关系结果构建表达性查询
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915210
Eirik Bakke, David R Karger
{"title":"Expressive Query Construction through Direct Manipulation of Nested Relational Results","authors":"Eirik Bakke, David R Karger","doi":"10.1145/2882903.2915210","DOIUrl":"https://doi.org/10.1145/2882903.2915210","url":null,"abstract":"Despite extensive research on visual query systems, the standard way to interact with relational databases remains to be through SQL queries and tailored form interfaces. We consider three requirements to be essential to a successful alternative: (1) query specification through direct manipulation of results, (2) the ability to view and modify any part of the current query without departing from the direct manipulation interface, and (3) SQL-like expressiveness. This paper presents the first visual query system to meet all three requirements in a single design. By directly manipulating nested relational results, and using spreadsheet idioms such as formulas and filters, the user can express a relationally complete set of query operators plus calculation, aggregation, outer joins, sorting, and nesting, while always remaining able to track and modify the state of the complete query. Our prototype gives the user an experience of responsive, incremental query building while pushing all actual query processing to the database layer. We evaluate our system with formative and controlled user studies on 28 spreadsheet users; the controlled study shows our system significantly outperforming Microsoft Access on the System Usability Scale.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78759289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People 亚德瓦谢姆多来源不确定实体解决方案:将大屠杀受害者报告转化为人
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2903737
Tomer Sagi, A. Gal, Omer Barkol, Ruth Bergman, Alexander Avram
{"title":"Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People","authors":"Tomer Sagi, A. Gal, Omer Barkol, Ruth Bergman, Alexander Avram","doi":"10.1145/2882903.2903737","DOIUrl":"https://doi.org/10.1145/2882903.2903737","url":null,"abstract":"In this work we describe an entity resolution project performed at Yad Vashem, the central repository of Holocaust-era information. The Yad Vashem dataset is unique with respect to classic entity resolution, by virtue of being both massively multi-source and by requiring multi-level entity resolution. With today's abundance of information sources, this project sets an example for multi-source resolution on a big-data scale. We discuss a set of requirements that led us to choose the MFIBlocks entity resolution algorithm in achieving the goals of the application. We also provide a machine learning approach, based upon decision trees to transform soft clusters into ranked clustering of records, representing possible entities. An extensive empirical evaluation demonstrates the unique properties of this dataset, highlighting the shortcomings of current methods and proposing avenues for future research in this realm.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82882720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Augmented Sketch: Faster and More Accurate Stream Processing 增强草图:更快,更准确的流处理
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882948
Pratanu Roy, Arijit Khan, G. Alonso
{"title":"Augmented Sketch: Faster and More Accurate Stream Processing","authors":"Pratanu Roy, Arijit Khan, G. Alonso","doi":"10.1145/2882903.2882948","DOIUrl":"https://doi.org/10.1145/2882903.2882948","url":null,"abstract":"Approximated algorithms are often used to estimate the frequency of items on high volume, fast data streams. The most common ones are variations of Count-Min sketch, which use sub-linear space for the count, but can produce errors in the counts of the most frequent items and can misclassify low-frequency items. In this paper, we improve the accuracy of sketch-based algorithms by increasing the frequency estimation accuracy of the most frequent items and reducing the possible misclassification of low-frequency items, while also improving the overall throughput. Our solution, called Augmented Sketch (ASketch), is based on a pre-filtering stage that dynamically identifies and aggregates the most frequent items. Items overflowing the pre-filtering stage are processed using a conventional sketch algorithm, thereby making the solution general and applicable in a wide range of contexts. The pre-filtering stage can be efficiently implemented with SIMD instructions on multi-core machines and can be further parallelized through pipeline parallelism where the filtering stage runs in one core and the sketch algorithm runs in another core.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87883532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
PrivateClean: Data Cleaning and Differential Privacy PrivateClean:数据清理和差异隐私
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915248
S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska
{"title":"PrivateClean: Data Cleaning and Differential Privacy","authors":"S. Krishnan, Jiannan Wang, M. Franklin, Ken Goldberg, Tim Kraska","doi":"10.1145/2882903.2915248","DOIUrl":"https://doi.org/10.1145/2882903.2915248","url":null,"abstract":"Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86123559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Efficient and Progressive Group Steiner Tree Search 高效进步的群斯坦纳树搜索
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2915217
Ronghua Li, Lu Qin, J. Yu, Rui Mao
{"title":"Efficient and Progressive Group Steiner Tree Search","authors":"Ronghua Li, Lu Qin, J. Yu, Rui Mao","doi":"10.1145/2882903.2915217","DOIUrl":"https://doi.org/10.1145/2882903.2915217","url":null,"abstract":"The Group Steiner Tree (GST) problem is a fundamental problem in database area that has been successfully applied to keyword search in relational databases and team search in social networks. The state-of-the-art algorithm for the GST problem is a parameterized dynamic programming (DP) algorithm, which finds the optimal tree in O(3kn+2k(n log n + m)) time, where k is the number of given groups, m and n are the number of the edges and nodes of the graph respectively. The major limitations of the parameterized DP algorithm are twofold: (i) it is intractable even for very small values of k (e.g., k=8) in large graphs due to its exponential complexity, and (ii) it cannot generate a solution until the algorithm has completed its entire execution. To overcome these limitations, we propose an efficient and progressive GST algorithm in this paper, called PrunedDP. It is based on newly-developed optimal-tree decomposition and conditional tree merging techniques. The proposed algorithm not only drastically reduces the search space of the parameterized DP algorithm, but it also produces progressively-refined feasible solutions during algorithm execution. To further speed up the PrunedDP algorithm, we propose a progressive A*-search algorithm, based on several carefully-designed lower-bounding techniques. We conduct extensive experiments to evaluate our algorithms on several large scale real-world graphs. The results show that our best algorithm is not only able to generate progressively-refined feasible solutions, but it also finds the optimal solution with at least two orders of magnitude acceleration over the state-of-the-art algorithm, using much less memory.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"96 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91473943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Accelerating Relational Databases by Leveraging Remote Memory and RDMA 利用远程内存和RDMA加速关系数据库
Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI: 10.1145/2882903.2882949
Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya
{"title":"Accelerating Relational Databases by Leveraging Remote Memory and RDMA","authors":"Feng Li, Sudipto Das, M. Syamala, Vivek R. Narasayya","doi":"10.1145/2882903.2882949","DOIUrl":"https://doi.org/10.1145/2882903.2882949","url":null,"abstract":"Memory is a crucial resource in relational databases (RDBMSs). When there is insufficient memory, RDBMSs are forced to use slower media such as SSDs or HDDs, which can significantly degrade workload performance. Cloud database services are deployed in data centers where network adapters supporting remote direct memory access (RDMA) at low latency and high bandwidth are becoming prevalent. We study the novel problem of how a Symmetric Multi-Processing (SMP) RDBMS, whose memory demands exceed locally-available memory, can leverage available remote memory in the cluster accessed via RDMA to improve query performance. We expose available memory on remote servers using a lightweight file API that allows an SMP RDBMS to leverage the benefits of remote memory with modest changes. We identify and implement several novel scenarios to demonstrate these benefits, and address design challenges that are crucial for efficient implementation. We implemented the scenarios in Microsoft SQL Server engine and present the first end-to-end study to demonstrate benefits of remote memory for a variety of micro-benchmarks and industry-standard benchmarks. Compared to using disks when memory is insufficient, we improve the throughput and latency of queries with short reads and writes by 3X to 10X, while improving the latency of multiple TPC-H and TPC-DS queries by 2X to 100X.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81799012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信