{"title":"RuleMiner: Data quality rules discovery","authors":"Xu Chu, I. Ilyas, Paolo Papotti, Yin Ye","doi":"10.1109/ICDE.2014.6816746","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816746","url":null,"abstract":"Integrity constraints (ICs) are valuables tools for enforcing correct application semantics. However, manually designing ICs require experts and time, hence the need for automatic discovery. Previous automatic ICs discovery suffer from (1) limited ICs language expressiveness; and (2) time-consuming manual verification of discovered ICs. We introduce RULEMINER, a system for discovering data quality rules that addresses the limitations of existing solutions.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116909865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast incremental SimRank on link-evolving graphs","authors":"Weiren Yu, Xuemin Lin, W. Zhang","doi":"10.1109/ICDE.2014.6816660","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816660","url":null,"abstract":"SimRank is an arresting measure of node-pair similarity based on hyperlinks. It iteratively follows the concept that 2 nodes are similar if they are referenced by similar nodes. Real graphs are often large, and links constantly evolve with small changes over time. This paper considers fast incremental computations of SimRank on link-evolving graphs. The prior approach [12] to this issue factorizes the graph via a singular value decomposition (SVD) first, and then incrementally maintains this factorization for link updates at the expense of exactness. Consequently, all node-pair similarities are estimated in O(r4n2) time on a graph of n nodes, where r is the target rank of the low-rank approximation, which is not negligibly small in practice. In this paper, we propose a novel fast incremental paradigm. (1) We characterize the SimRank update matrix ΔS, in response to every link update, via a rank-one Sylvester matrix equation. By virtue of this, we devise a fast incremental algorithm computing similarities of n2 node-pairs in O(Kn2) time for K iterations. (2) We also propose an effective pruning technique capturing the “affected areas” of ΔS to skip unnecessary computations, without loss of exactness. This can further accelerate the incremental SimRank computation to O(K(nd+|AFF|)) time, where d is the average in-degree of the old graph, and |AFF| (≤ n2) is the size of “affected areas” in ΔS, and in practice, |AFF| ≪ n2. Our empirical evaluations verify that our algorithm (a) outperforms the best known link-update algorithm [12], and (b) runs much faster than its batch counterpart when link updates are small.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129257459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Sarma, Aditya G. Parameswaran, H. Garcia-Molina, A. Halevy
{"title":"Crowd-powered find algorithms","authors":"A. Sarma, Aditya G. Parameswaran, H. Garcia-Molina, A. Halevy","doi":"10.1109/ICDE.2014.6816715","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816715","url":null,"abstract":"We consider the problem of using humans to find a bounded number of items satisfying certain properties, from a data set. For instance, we may want humans to identify a select number of travel photos from a data set of photos to display on a travel website, or a candidate set of resumes that meet certain requirements from a large pool of applicants. Since data sets can be enormous, and since monetary cost and latency of data processing with humans can be large, optimizing the use of humans for finding items is an important challenge. We formally define the problem using the metrics of cost and time, and design optimal algorithms that span the skyline of cost and time, i.e., we provide designers the ability to control the cost vs. time trade-off. We study the deterministic as well as error-prone human answer settings, along with multiplicative and additive approximations. Lastly, we study how we may design algorithms with specific expected cost and time measures.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129818086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anders Skovsgaard, Darius Sidlauskas, Christian S. Jensen
{"title":"Scalable top-k spatio-temporal term querying","authors":"Anders Skovsgaard, Darius Sidlauskas, Christian S. Jensen","doi":"10.1109/ICDE.2014.6816647","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816647","url":null,"abstract":"With the rapidly increasing deployment of Internet-connected, location-aware mobile devices, very large and increasing amounts of geo-tagged and timestamped user-generated content, such as microblog posts, are being generated. We present indexing, update, and query processing techniques that are capable of providing the top-k terms seen in posts in a user-specified spatio-temporal range. The techniques enable interactive response times in the millisecond range in a realistic setting where the arrival rate of posts exceeds today's average tweet arrival rate by a factor of 4-10. The techniques adaptively maintain the most frequent items at various spatial and temporal granularities. They extend existing frequent item counting techniques to maintain exact counts rather than approximations. An extensive empirical study with a large collection of geo-tagged tweets shows that the proposed techniques enable online aggregation and query processing at scale in realistic settings.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114833529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reynold Cheng, Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Goce Trajcevski, Andreas Züfle
{"title":"Managing uncertainty in spatial and spatio-temporal data","authors":"Reynold Cheng, Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Goce Trajcevski, Andreas Züfle","doi":"10.1109/ICDE.2014.6816766","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816766","url":null,"abstract":"Location-related data has a tremendous impact in many applications of high societal relevance and its growing volume from heterogeneous sources is one true example of a Big Data [1]. An inherent property of any spatio-temporal dataset is uncertainty due to various sources of imprecision. This tutorial provides a comprehensive overview of the different challenges involved in managing uncertain spatial and spatio-temporal data and presents state-of-the-art techniques for addressing them.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114249727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Zong, R. Raghavendra, M. Srivatsa, Xifeng Yan, Ambuj K. Singh, Kang-Won Lee
{"title":"Cloud service placement via subgraph matching","authors":"Bo Zong, R. Raghavendra, M. Srivatsa, Xifeng Yan, Ambuj K. Singh, Kang-Won Lee","doi":"10.1109/ICDE.2014.6816704","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816704","url":null,"abstract":"Fast service placement, finding a set of nodes with enough free capacity of computation, storage, and network connectivity, is a routine task in daily cloud administration. In this work, we formulate this as a subgraph matching problem. Different from the traditional setting, including approximate and probabilistic graphs, subgraph matching on data-center networks has two unique properties. (1) Node/edge labels representing vacant CPU cycles and network bandwidth change rapidly, while the network topology varies little. (2) There is a partial order on node/edge labels. Basically, one needs to place service in nodes with enough free capacity. Existing graph indexing techniques have not considered very frequent label updates, and none of them supports partial order on numeric labels. Therefore, we resort to a new graph index framework, Gradin, to address both challenges. Gradin encodes subgraphs into multi-dimensional vectors and organizes them with indices such that it can efficiently search the matches of a query's subgraphs and combine them to form a full match. In particular, we analyze how the index parameters affect update and search performance with theoretical results. Moreover, a revised pruning algorithm is introduced to reduce unnecessary search during the combination of partial matches. Using both real and synthetic datasets, we demonstrate that Gradin outperforms the baseline approaches up to 10 times.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124019966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective location identification from microblogs","authors":"Guoliang Li, Jun Hu, Jianhua Feng, K. Tan","doi":"10.1109/ICDE.2014.6816708","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816708","url":null,"abstract":"The rapid development of social networks has resulted in a proliferation of user-generated content (UGC). The UGC data, when properly analyzed, can be beneficial to many applications. For example, identifying a user's locations from microblogs is very important for effective location-based advertisement and recommendation. In this paper, we study the problem of identifying a user's locations from microblogs. This problem is rather challenging because the location information in a microblog is incomplete and we cannot get an accurate location from a local microblog. To address this challenge, we propose a global location identification method, called Glitter. Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations. Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, states, cities, districts, and streets. Using the tree structure, Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then Glitter aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, Glitter refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales very well.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123206531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous pattern detection over billion-edge graph using distributed framework","authors":"Jun Gao, Chang Zhou, Jiashuai Zhou, J. Yu","doi":"10.1109/ICDE.2014.6816681","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816681","url":null,"abstract":"Continuous pattern detection plays an important role in monitoring-related applications. The large size and dynamic update of graphs, along with the massive search space, pose huge challenges in developing an efficient continuous pattern detection system. In this paper, we leverage a distributed graph processing framework to approximately detect a given pattern over a large dynamic graph. We aim to improve the scalability and precision, and reduce the response time and message cost in the detection. We convert a given query pattern into a Single-Sink DAG (Directed Acyclic Graph), and propose an evaluation plan with message transitions on the DAG, which is shorten by SSD plan, to detect the pattern in a large dynamic graph. SSD plan can guide the data graph exploration via messages, and the messages will converge at data sink vertices, which then detect existences of the query pattern. We also conduct join operations over partial vertices during the graph exploration to improve the precision of pattern detection. In addition, we show that SSD plan can support the continuous query over dynamic graphs with slight extensions. We further design various sink vertex selection strategies and neighborhood based transition rule attachment to lower the evaluation cost. The experiments on billion-edge real-life graphs using Giraph, an open source implementation of Pregel, illustrate the efficiency and effectiveness of our method.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"333 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132134699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Private search on key-value stores with hierarchical indexes","authors":"Haibo Hu, Jianliang Xu, Xizhong Xu, Kexin Pei, Byron Choi, Shuigeng Zhou","doi":"10.1109/ICDE.2014.6816687","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816687","url":null,"abstract":"Query processing that preserves both the query privacy at the client and the data privacy at the server is a new research problem. It has many practical applications, especially when the queries are about the sensitive attributes of records. However, most existing studies, including those originating from data outsourcing, address the data privacy and query privacy separately. Although secure multiparty computation (SMC) is a suitable computing paradigm for this problem, it has significant computation and communication overheads, thus unable to scale up to large datasets. Fortunately, recent advances in cryptography bring us two relevant tools - conditional oblivious transfer and homomorphic encryption. In this paper, we integrate database indexing techniques with these tools in the context of private search on key-value stores. We first present an oblivious index traversal framework, in which the server cannot trace the index traversal path of a query during evaluation. The framework is generic and can support a wide range of query types with a suitable homomorphic encryption algorithm in place. Based on this framework, we devise secure protocols for classic key search queries on B+-tree and R-tree indexes. Our approach is verified by both security analysis and performance study.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128796101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MELODY-JOIN: Efficient Earth Mover's Distance similarity joins using MapReduce","authors":"Jin Huang, Rui Zhang, R. Buyya, Jian Chen","doi":"10.1109/ICDE.2014.6816702","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816702","url":null,"abstract":"The Earth Mover's Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubic to the number of bins in the histograms used to represent the data objects. Consequently, the EMD similarity join operation is prohibitive for large datasets. This is the first paper that specifically addresses the EMD similarity join and we propose to use MapReduce to approach this problem. The MapReduce algorithms designed for generic metric distance similarity joins are inefficient for the EMD similarity join because they involve a large number of distance computations and have unbalanced workloads on reducers when dealing with skewed datasets. We propose a novel framework, named MELODY-JOIN, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has a constant complexity. Furthermore, we address two key problems, the limited pruning power and the unbalanced workloads, by enhancing each phase in the MELODY-JOIN framework. We conduct extensive experiments on real datasets. The results show that MELODY-JOIN outperforms the state-of-the-art technique by an order of magnitude, scales up better on large datasets than the state-of-the-art technique, and scales out well on distributed machines.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127048964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}