2014 IEEE 30th International Conference on Data Engineering最新文献

Rethinking main memory OLTP recovery 重新思考主存OLTP恢复

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816685

Nirmesh Malviya, Ariel Weisberg, S. Madden, M. Stonebraker

{"title":"Rethinking main memory OLTP recovery","authors":"Nirmesh Malviya, Ariel Weisberg, S. Madden, M. Stonebraker","doi":"10.1109/ICDE.2014.6816685","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816685","url":null,"abstract":"Fine-grained, record-oriented write-ahead logging, as exemplified by systems like ARIES, has been the gold standard for relational database recovery. In this paper, we show that in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In particular, as transaction throughputs get higher, ARIES-style logging starts to represent a non-trivial fraction of the overall transaction execution time. We propose a lighter weight, coarse-grained command logging technique which only records the transactions that were executed on the database. It then does recovery by starting from a transactionally consistent checkpoint and replaying the commands in the log as if they were new transactions. By avoiding the overhead of fine-grained logging of before and after images (both CPU complexity as well as substantial associated 110), command logging can yield significantly higher throughput at run-time. Recovery times for command logging are higher compared to an ARIEs-style physiological logging approach, but with the advent of high-availability techniques that can mask the outage of a recovering node, recovery speeds have become secondary in importance to run-time performance for most applications. We evaluated our approach on an implementation of TPCC in a main memory database system (VoltDB), and found that command logging can offer 1.5 x higher throughput than a main-memory optimized implementation of ARIEs-style physiological logging.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 135

Contract & Expand: I/O Efficient SCCs Computing 合同和扩展:I/O高效SCCs计算

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816652

Zhiwei Zhang, Lu Qin, J. Yu

{"title":"Contract & Expand: I/O Efficient SCCs Computing","authors":"Zhiwei Zhang, Lu Qin, J. Yu","doi":"10.1109/ICDE.2014.6816652","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816652","url":null,"abstract":"As an important branch of big data processing, big graph processing is becoming increasingly popular in recent years. Strongly connected component (SCC) computation is a fundamental graph operation on directed graphs, where an SCC is a maximal subgraph S of a directed graph G in which every pair of nodes is reachable from each other in S. By contracting each SCC into a node, a large general directed graph can be represented by a small directed acyclic graph (DAG). In the literature, there are I/O efficient semi-external algorithms to compute all SCCs of a graph G, by assuming that all nodes of a graph G can fit in the main memory. However, many real graphs are large and even the nodes cannot reside entirely in the main memory. In this paper, we study new I/O efficient external algorithms to find all SCCs for a directed graph G whose nodes cannot fit entirely in the main memory. To overcome the deficiency of the existing external graph contraction based approach that usually cannot stop in finite iterations, and the external DFS based approach that will generate a large number of random I/Os, we explore a new contraction-expansion based approach. In the graph contraction phase, instead of contracting the whole graph as the contraction based approach, we only contract the nodes of a graph, which are much more selective. The contraction phase stops when all nodes of the graph can fit in the main memory, such that the semi-external algorithm can be used in SCC computation. In the graph expansion phase, as the graph is expanded in the reverse order as it is contracted, the SCCs of all nodes in the graph are computed. Both graph contraction phase and graph expansion phase use only I/O efficient sequential scans and external sorts of nodes/edges in the graph. Our algorithm leverages the efficiency of the semi-external SCC computation algorithm and usually stops in a small number of iterations. We further optimize our approach by reducing the size of nodes and edges of the contracted graph in each iteration. We conduct extensive experimental studies using both real and synthetic web-scale graphs to confirm the I/O efficiency of our approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116405072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Query optimization of distributed pattern matching 分布式模式匹配的查询优化

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816640

Jiewen Huang, K. Venkatraman, D. Abadi

引用次数: 38

Leveraging metadata for identifying local, robust multi-variate temporal (RMT) features 利用元数据来识别本地的、健壮的多变量时态(RMT)特征

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816667

Xiaolan Wang, K. Candan, M. Sapino

引用次数: 11

DBDesigner: A customizable physical design tool for Vertica Analytic Database DBDesigner:为Vertica分析数据库定制的物理设计工具

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816725

R. Varadarajan, V. Bharathan, A. Cary, J. Dave, Sreenath Bodagala

{"title":"DBDesigner: A customizable physical design tool for Vertica Analytic Database","authors":"R. Varadarajan, V. Bharathan, A. Cary, J. Dave, Sreenath Bodagala","doi":"10.1109/ICDE.2014.6816725","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816725","url":null,"abstract":"In this paper, we present Vertica's customizable physical design tool, called the DBDesigner (DBD), that produces designs optimized for various scenarios and applications. For a given workload and space budget, DBD automatically recommends a physical design that optimizes query performance, storage footprint, fault tolerance and recovery to meet different customer requirements. Vertica is a distributed, massively parallel columnar database that physically organizes data into projections. Projections are attribute subsets from one or more tables with tuples sorted by one or more attributes, that are replicated or segmented (distributed) on cluster nodes. The key challenges involved in projection design are picking appropriate column sets, sort orders, cluster data distributions and column encodings. To achieve the desired trade-off between query performance and storage footprint, DBD operates under three different design policies: (a) load-optimized, (b) query-optimized or (c) balanced. These policies indirectly control the number of projections proposed and queries optimized to achieve the desired balance. To cater to query workloads that evolve over time, DBD also operates in a comprehensive and incremental design mode. In addition, DBD lets users override specific features of projection design based on their intimate knowledge about the data and query workloads. We present the complete physical design algorithm, describing in detail how projection candidates are efficiently explored and evaluated using optimizer's cost and benefit model. Our experimental results show that DBD produces good physical designs that satisfy a variety of customer use cases.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129695452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries SLICE:为反向k近邻查询恢复基于区域的修剪

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816698

Shiyu Yang, M. A. Cheema, Xuemin Lin, Ying Zhang

{"title":"SLICE: Reviving regions-based pruning for reverse k nearest neighbors queries","authors":"Shiyu Yang, M. A. Cheema, Xuemin Lin, Ying Zhang","doi":"10.1109/ICDE.2014.6816698","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816698","url":null,"abstract":"Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k-closest facilities. Due to its importance, RkNN query has received significant research attention in the past few years. Almost all of the existing techniques adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new RkNN algorithm called SLICE that utilizes the strength of regions-based pruning and overcomes its limitations. Our extensive experimental study on synthetic and real data sets demonstrate that SLICE is significantly more efficient than the existing algorithms. We also provide a detailed theoretical analysis to analyze various aspects of our algorithm such as I/O cost, the unpruned area, and the cost of its verification phase etc. The experimental study validates our theoretical analysis.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126269172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Keyword-based correlated network computation over large social media 基于关键词的大型社交媒体相关网络计算

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816657

Jianxin Li, Chengfei Liu, Md. Saiful Islam

{"title":"Keyword-based correlated network computation over large social media","authors":"Jianxin Li, Chengfei Liu, Md. Saiful Islam","doi":"10.1109/ICDE.2014.6816657","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816657","url":null,"abstract":"Recent years have witnessed an unprecedented proliferation of social media, e.g., millions of blog posts, micro-blog posts, and social networks on the Internet. This kind of social media data can be modeled in a large graph where nodes represent the entities and edges represent relationships between entities of the social media. Discovering keyword-based correlated networks of these large graphs is an important primitive in data analysis, from which users can pay more attention about their concerned information in the large graph. In this paper, we propose and define the problem of keyword-based correlated network computation over a massive graph. To do this, we first present a novel tree data structure that only maintains the shortest path of any two graph nodes, by which the massive graph can be equivalently transformed into a tree data structure for addressing our proposed problem. After that, we design efficient algorithms to build the transformed tree data structure from a graph offline and compute the γ-bounded keyword matched subgraphs based on the pre-built tree data structure on the fly. To further improve the efficiency, we propose weighted shingle-based approximation approaches to measure the correlation among a large number of γ-bounded keyword matched subgraphs. At last, we develop a merge-sort based approach to efficiently generate the correlated networks. Our extensive experiments demonstrate the efficiency of our algorithms on reducing time and space cost. The experimental results also justify the effectiveness of our method in discovering correlated networks from three real datasets.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128084700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Finding common ground among experts' opinions on data clustering: With applications in malware analysis 在数据聚类的专家意见中找到共同点:在恶意软件分析中的应用

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816636

Guanhua Yan

{"title":"Finding common ground among experts' opinions on data clustering: With applications in malware analysis","authors":"Guanhua Yan","doi":"10.1109/ICDE.2014.6816636","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816636","url":null,"abstract":"Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130059670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Pay-as-you-go reconciliation in schema matching networks 模式匹配网络中的现收现付协调

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816653

Nguyen Quoc Viet Hung, T. Nguyen, Z. Miklós, K. Aberer, A. Gal, M. Weidlich

引用次数: 48

ADaPT: Automatic Data Personalization based on contextual preferences ADaPT:基于上下文偏好的自动数据个性化

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816749

A. Miele, E. Quintarelli, Emanuele Rabosio, L. Tanca

引用次数: 6