Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data最新文献

Oracle Workload Intelligence Oracle工作负载智能

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742791

Quoc Trung Tran, Konstantinos Morfonios, N. Polyzotis

引用次数: 13

Influence Maximization in Near-Linear Time: A Martingale Approach 近线性时间影响最大化:鞅方法

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2723734

Youze Tang, Yanchen Shi, Xiaokui Xiao

{"title":"Influence Maximization in Near-Linear Time: A Martingale Approach","authors":"Youze Tang, Yanchen Shi, Xiaokui Xiao","doi":"10.1145/2723372.2723734","DOIUrl":"https://doi.org/10.1145/2723372.2723734","url":null,"abstract":"Given a social network G and a positive integer k, the influence maximization problem asks for k nodes (in G) whose adoptions of a certain idea or product can trigger the largest expected number of follow-up adoptions by the remaining nodes. This problem has been extensively studied in the literature, and the state-of-the-art technique runs in O((k+l) (n+m) log n ε2) expected time and returns a (1-1 e-ε)-approximate solution with at least 1 - 1/n l probability. This paper presents an influence maximization algorithm that provides the same worst-case guarantees as the state of the art, but offers significantly improved empirical efficiency. The core of our algorithm is a set of estimation techniques based on martingales, a classic statistical tool. Those techniques not only provide accurate results with small computation overheads, but also enable our algorithm to support a larger class of information diffusion models than existing methods do. We experimentally evaluate our algorithm against the states of the art under several popular diffusion models, using real social networks with up to 1.4 billion edges. Our experimental results show that the proposed algorithm consistently outperforms the states of the art in terms of computation efficiency, and is often orders of magnitude faster.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"351 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115606639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 674

Twitter Heron: Stream Processing at Scale Twitter Heron:大规模的流处理

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742788

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, J. Patel, K. Ramasamy, Siddarth Taneja

引用次数: 576

Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity 野性并发控制:现代应用程序完整性的实证研究

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2737784

Peter D. Bailis, A. Fekete, M. Franklin, A. Ghodsi, J. Hellerstein, I. Stoica

{"title":"Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity","authors":"Peter D. Bailis, A. Fekete, M. Franklin, A. Ghodsi, J. Hellerstein, I. Stoica","doi":"10.1145/2723372.2737784","DOIUrl":"https://doi.org/10.1145/2723372.2737784","url":null,"abstract":"The rise of data-intensive \"Web 2.0\" Internet services has led to a range of popular new programming frameworks that collectively embody the latest incarnation of the vision of Object-Relational Mapping (ORM) systems, albeit at unprecedented scale. In this work, we empirically investigate modern ORM-backed applications' use and disuse of database concurrency control mechanisms. Specifically, we focus our study on the common use of feral, or application-level, mechanisms for maintaining database integrity, which, across a range of ORM systems, often take the form of declarative correctness criteria, or invariants. We quantitatively analyze the use of these mechanisms in a range of open source applications written using the Ruby on Rails ORM and find that feral invariants are the most popular means of ensuring integrity (and, by usage, are over 37 times more popular than transactions). We evaluate which of these feral invariants actually ensure integrity (by usage, up to 86.9%) and which---due to concurrency errors and lack of database support---may lead to data corruption (the remainder), which we experimentally quantify. In light of these findings, we present recommendations for database system designers for better supporting these modern ORM programming patterns, thus eliminating their adverse effects on application integrity.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"251 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115755948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

Bayesian Differential Privacy on Correlated Data 相关数据的贝叶斯差分隐私

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2747643

Bin Yang, Issei Sato, Hiroshi Nakagawa

{"title":"Bayesian Differential Privacy on Correlated Data","authors":"Bin Yang, Issei Sato, Hiroshi Nakagawa","doi":"10.1145/2723372.2747643","DOIUrl":"https://doi.org/10.1145/2723372.2747643","url":null,"abstract":"Differential privacy provides a rigorous standard for evaluating the privacy of perturbation algorithms. It has widely been regarded that differential privacy is a universal definition that deals with both independent and correlated data and a differentially private algorithm can protect privacy against arbitrary adversaries. However, recent research indicates that differential privacy may not guarantee privacy against arbitrary adversaries if the data are correlated. In this paper, we focus on the private perturbation algorithms on correlated data. We investigate the following three problems: (1) the influence of data correlations on privacy; (2) the influence of adversary prior knowledge on privacy; and (3) a general perturbation algorithm that is private for prior knowledge of any subset of tuples in the data when the data are correlated. We propose a Pufferfish definition of privacy, called Bayesian differential privacy, by which the privacy level of a probabilistic perturbation algorithm can be evaluated even when the data are correlated and when the prior knowledge is incomplete. We present a Gaussian correlation model to accurately describe the structure of data correlations and analyze the Bayesian differential privacy of the perturbation algorithm on the basis of this model. Our results show that privacy is poorest for an adversary who has the least prior knowledge. We further extend this model to a more general one that considers uncertain prior knowledge.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116906341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components 纯度:从商品组件构建快速，高可用性的企业闪存

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742798

John Colgrove, John D. Davis, John Hayes, E. L. Miller, C. Sandvig, R. Sears, Ariel Tamches, Neil Vachharajani, Feng Wang

{"title":"Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components","authors":"John Colgrove, John D. Davis, John Hayes, E. L. Miller, C. Sandvig, R. Sears, Ariel Tamches, Neil Vachharajani, Feng Wang","doi":"10.1145/2723372.2742798","DOIUrl":"https://doi.org/10.1145/2723372.2742798","url":null,"abstract":"Although flash storage has largely replaced hard disks in consumer class devices, enterprise workloads pose unique challenges that have slowed adoption of flash in ``performance tier'' storage appliances. In this paper, we describe Purity, the foundation of Pure Storage's Flash Arrays, the first all-flash enterprise storage system to support compression, deduplication, and high-availability. Purity borrows techniques from modern database and key-value storage architectures, and introduces novel storage primitives that have wide applicability to data management systems. For instance, all writes in Purity are monotonic, and deletions are handled using an atomic predicate-based tuple elision primitive. Purity's redundancy mechanisms are optimized for SSD failure modes and performance characteristics, allowing for fast recovery from component failures and lower space overhead than the best hard disk systems. We built deduplication and data compression schemes atop these primitives. Flash changes storage capacity/performance tradeoffs: unlike disk-based systems, flash deployments are rarely performance bound. A single Purity appliance can provide over 7GiB/s of throughput on 32KiB random I/Os, even through multiple device failures, and while providing asynchronous off-site replication. Typical installations have 99.9% latencies under 1ms, and production arrays average 5.4x data reduction and 99.999% availability. Purity takes advantage of storage performance increasing more rapidly than computational performance to build a simpler (with respect to engineering, installation, and management) scale-up storage appliance that supports hundreds of terabytes of highly-available, high-performance storage. The resulting performance and capacity supports many customer deployments of multiple applications, including scale-out and parallel systems, such as MongoDB and Oracle RAC, on a single Purity appliance.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117046682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Rethinking Data-Intensive Science Using Scalable Analytics Systems 使用可扩展分析系统重新思考数据密集型科学

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2742787

Frank A. Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, M. Linderman, M. Franklin, A. Joseph, D. Patterson

{"title":"Rethinking Data-Intensive Science Using Scalable Analytics Systems","authors":"Frank A. Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, M. Linderman, M. Franklin, A. Joseph, D. Patterson","doi":"10.1145/2723372.2742787","DOIUrl":"https://doi.org/10.1145/2723372.2742787","url":null,"abstract":"\"Next generation\" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines. In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity \"big data\" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8--8.9x improvement over the state-of-the-art MPI-based system.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117346103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams 位置感知Pub/Sub系统:当连续移动查询满足动态事件流时

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2746481

Long Guo, Dongxiang Zhang, Guoliang Li, K. Tan, Z. Bao

{"title":"Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams","authors":"Long Guo, Dongxiang Zhang, Guoliang Li, K. Tan, Z. Bao","doi":"10.1145/2723372.2746481","DOIUrl":"https://doi.org/10.1145/2723372.2746481","url":null,"abstract":"In this paper, we propose a new location-aware pub/sub system, Elaps, that continuously monitors moving users subscribing to dynamic event streams from social media and E-commerce applications. Users are notified instantly when there is a matching event nearby. To the best of our knowledge, Elaps is the first to take into account continuous moving queries against dynamic event streams. Like existing works on continuous moving query processing,Elaps employs the concept of safe region to reduce communication overhead. However, unlike existing works which assume data from publishers are static, updates to safe regions may be triggered by newly arrived events. In Elaps, we develop a concept called textit{impact region} that allows us to identify whether a safe region is affected by newly arrived events. Moreover, we propose a novel cost model to optimize the safe region size to keep the communication overhead low. Based on the cost model, we design two incremental methods, iGM and idGM, for safe region construction. In addition, Elaps uses boolean expression, which is more expressive than keywords, to model user intent and we propose a novel index, BEQ-Tree, to handle spatial boolean expression matching. In our experiments, we use geo-tweets from Twitter and venues from Foursquare to simulate publishers and boolean expressions generated from AOL search log to represent users intentions. We test user movement in both synthetic trajectories and real taxi trajectories. The results show that Elaps can significantly reduce the communication overhead and disseminate events to users in real-time.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"69 1-3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123469038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Mining Quality Phrases from Massive Text Corpora 从海量文本语料库中挖掘优质短语

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2751523

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han

引用次数: 191

LASH: Large-Scale Sequence Mining with Hierarchies 基于层次结构的大规模序列挖掘

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI: 10.1145/2723372.2723724

Kaustubh Beedkar, Rainer Gemulla

{"title":"LASH: Large-Scale Sequence Mining with Hierarchies","authors":"Kaustubh Beedkar, Rainer Gemulla","doi":"10.1145/2723372.2723724","DOIUrl":"https://doi.org/10.1145/2723372.2723724","url":null,"abstract":"We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128622190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17