Xuan-Hong Dang, I. Assent, R. Ng, A. Zimek, Erich Schubert
{"title":"Discriminative features for identifying and interpreting outliers","authors":"Xuan-Hong Dang, I. Assent, R. Ng, A. Zimek, Erich Schubert","doi":"10.1109/ICDE.2014.6816642","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816642","url":null,"abstract":"We consider the problem of outlier detection and interpretation. While most existing studies focus on the first problem, we simultaneously address the equally important challenge of outlier interpretation. We propose an algorithm that uncovers outliers in subspaces of reduced dimensionality in which they are well discriminated from regular objects while at the same time retaining the natural local structure of the original data to ensure the quality of outlier explanation. Our algorithm takes a mathematically appealing approach from the spectral graph embedding theory and we show that it achieves the globally optimal solution for the objective of subspace learning. By using a number of real-world datasets, we demonstrate its appealing performance not only w.r.t. the outlier detection rate but also w.r.t. the discriminative human-interpretable features. This is the first approach to exploit discriminative features for both outlier detection and interpretation, leading to better understanding of how and why the hidden outliers are exceptional.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127333021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Gómez Ferro, F. Junqueira, I. Kelly, B. Reed, M. Yabandeh
{"title":"Omid: Lock-free transactional support for distributed data stores","authors":"Daniel Gómez Ferro, F. Junqueira, I. Kelly, B. Reed, M. Yabandeh","doi":"10.1109/ICDE.2014.6816691","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816691","url":null,"abstract":"In this paper, we introduce Omid, a tool for lock-free transactional support in large data stores such as HBase. Omid uses a centralized scheme and implements snapshot isolation, a property that guarantees that all read operations of a transaction are performed on a consistent snapshot of the data. In a lock-based approach, the unreleased, distributed locks that are held by a failed or slow client block others. By using a centralized scheme for Omid, we are able to implement a lock-free commit algorithm, which does not suffer from this problem. Moreover, Omid lightly replicates a read-only copy of the transaction metadata into the clients where they can locally service a large part of queries on metadata. Thanks to this technique, Omid does not require modifying either the source code of the data store or the tables' schema, and the overhead on data servers is also negligible. The experimental results show that our implementation on a simple dual-core machine can service up to a thousand of client machines. While the added latency is limited to only 10 ms, Omid scales up to 124K write transactions per second. Since this capacity is multiple times larger than the maximum reported traffic in similar systems, we do not expect the centralized scheme of Omid to be a bottleneck even for current large data stores.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123780059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pigeon: A spatial MapReduce language","authors":"A. Eldawy, M. Mokbel","doi":"10.1109/ICDE.2014.6816751","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816751","url":null,"abstract":"With the huge amounts of spatial data collected everyday, MapReduce frameworks, such as Hadoop, have become a common choice to analyze big spatial data for scientists and people from industry. Users prefer to use high level languages, such as Pig Latin, to deal with Hadoop for simplicity. Unfortunately, these languages are designed for primitive non-spatial data and have no support for spatial data types or functions. This demonstration presents Pigeon, a spatial extension to Pig which provides spatial functionality in Pig. Pigeon is implemented through user defined functions (UDFs) making it easy to use and compatible with all recent versions of Pig. This also allows it to integrate smoothly with existing non-spatial functions and operations such as Filter, Join and Group By. Pigeon is compatible with the Open Geospatial Consortium (OGC) standard which makes it easy to learn and use for users who are familiar with existing OGC-compliant tools such as PostGIS. This demonstrations shows to audience how to work with Pigeon through some interesting applications running on large scale real datasets extracted from OpenStreetMap.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130542816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Huang, Yunjun Gao, K. Chiew, Lei Chen, Qinming He
{"title":"Towards effective and efficient mining of arbitrary shaped clusters","authors":"H. Huang, Yunjun Gao, K. Chiew, Lei Chen, Qinming He","doi":"10.1109/ICDE.2014.6816637","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816637","url":null,"abstract":"Mining arbitrary shaped clusters in large data sets is an open challenge in data mining. Various approaches to this problem have been proposed with high time complexity. To save computational cost, some algorithms try to shrink a data set size to a smaller amount of representative data examples. However, their user-defined shrinking ratios may significantly affect the clustering performance. In this paper, we present CLASP an effective and efficient algorithm for mining arbitrary shaped clusters. It automatically shrinks the size of a data set while effectively preserving the shape information of clusters in the data set with representative data examples. Then, it adjusts the positions of these representative data examples to enhance their intrinsic relationship and make the cluster structures more clear and distinct for clustering. Finally, it performs agglomerative clustering to identify the cluster structures with the help of a mutual k-nearest neighbors-based similarity metric called Pk. Extensive experiments on both synthetic and real data sets are conducted, and the results verify the effectiveness and efficiency of our approach.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Zeng, Z. Bao, T. Ling, H. Jagadish, Guoliang Li
{"title":"Breaking out of the MisMatch trap","authors":"Yong Zeng, Z. Bao, T. Ling, H. Jagadish, Guoliang Li","doi":"10.1109/ICDE.2014.6816713","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816713","url":null,"abstract":"When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results.We call this problem the MisMatch Problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: Target Node Type and Distinguishability. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any LCA-based matching semantics and is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128892981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danica Porobic, Erietta Liarou, Pınar Tözün, A. Ailamaki
{"title":"ATraPos: Adaptive transaction processing on hardware Islands","authors":"Danica Porobic, Erietta Liarou, Pınar Tözün, A. Ailamaki","doi":"10.1109/ICDE.2014.6816692","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816692","url":null,"abstract":"Nowadays, high-performance transaction processing applications increasingly run on multisocket multicore servers. Such architectures exhibit non-uniform memory access latency as well as non-uniform thread communication costs. Unfortunately, traditional shared-everything database management systems are designed for uniform inter-core communication speeds. This causes unpredictable access latencies in the critical path. While lack of data locality may be a minor nuisance on systems with fewer than 4 processors, it becomes a serious scalability limitation on larger systems due to accesses to centralized data structures. In this paper, we propose ATraPos, a storage manager design that is aware of the non-uniform access latencies of multisocket systems. ATraPos achieves good data locality by carefully partitioning the data as well as internal data structures (e.g., state information) to the available processors and by assigning threads to specific partitions. Furthermore, ATraPos dynamically adapts to the workload characteristics, i.e., when the workload changes, ATraPos detects the change and automatically revises the data partitioning and thread placement to fit the current access patterns and hardware topology. We prototype ATraPos on top of an open-source storage manager Shore-MT and we present a detailed experimental analysis with both synthetic and standard (TPC-C and TATP) benchmarks. We show that ATraPos exhibits performance improvements of a factor ranging from 1.4 to 6.7x for a wide collection of transactional workloads. In addition, we show that the adaptive monitoring and partitioning scheme of ATraPos poses a negligible cost, while it allows the system to dynamically and gracefully adapt when the workload changes.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130720796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Random-walk domination in large graphs","authors":"Ronghua Li, J. Yu, Xin Huang, Hong Cheng","doi":"10.1109/ICDE.2014.6816696","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816696","url":null,"abstract":"We introduce and formulate two types of random-walk domination problems in graphs motivated by a number of applications in practice (e.g., item-placement problem in online social networks, Ads-placement problem in advertisement networks, and resource-placement problem in P2P networks). Specifically, given a graph G, the goal of the first type of random-walk domination problem is to target k nodes such that the total hitting time of an L-length random walk starting from the remaining nodes to the targeted nodes is minimized. The second type of random-walk domination problem is to find k nodes to maximize the expected number of nodes that hit any one targeted node through an L-length random walk. We prove that these problems are two special instances of the submodular set function maximization with cardinality constraint problem. To solve them effectively, we propose a dynamic-programming (DP) based greedy algorithm which is with near-optimal performance guarantee. The DP-based greedy algorithm, however, is not very efficient due to the expensive marginal gain evaluation. To further speed up the algorithm, we propose an approximate greedy algorithm with linear time complexity w.r.t. the graph size and also with near-optimal performance guarantee. The approximate greedy algorithm is based on carefully designed random walk sampling and sample-materialization techniques. Extensive experiments demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarath Kumar Kondreddi, P. Triantafillou, G. Weikum
{"title":"Combining information extraction and human computing for crowdsourced knowledge acquisition","authors":"Sarath Kumar Kondreddi, P. Triantafillou, G. Weikum","doi":"10.1109/ICDE.2014.6816717","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816717","url":null,"abstract":"Automatic information extraction (IE) enables the construction of very large knowledge bases (KBs), with relational facts on millions of entities from text corpora and Web sources. However, such KBs contain errors and they are far from being complete. This motivates the need for exploiting human intelligence and knowledge using crowd-based human computing (HC) for assessing the validity of facts and for gathering additional knowledge. This paper presents a novel system architecture, called Higgins, which shows how to effectively integrate an IE engine and a HC engine. Higgins generates game questions where players choose or fill in missing relations for subject-relation-object triples. For generating multiple-choice answer candidates, we have constructed a large dictionary of entity names and relational phrases, and have developed specifically designed statistical language models for phrase relatedness. To this end, we combine semantic resources like WordNet, ConceptNet, and others with statistics derived from a large Web corpus. We demonstrate the effectiveness of Higgins for knowledge acquisition by crowdsourced gathering of relationships between characters in narrative descriptions of movies and books.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122758383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Floris Geerts, G. Mecca, Paolo Papotti, Donatello Santoro
{"title":"Mapping and cleaning","authors":"Floris Geerts, G. Mecca, Paolo Papotti, Donatello Santoro","doi":"10.1109/ICDE.2014.6816654","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816654","url":null,"abstract":"We address the challenging and open problem of bringing together two crucial activities in data integration and data quality, i.e., transforming data using schema mappings, and fixing conflicts and inconsistencies using data repairing. This problem is made complex by several factors. First, schema mappings and data repairing have traditionally been considered as separate activities, and research has progressed in a largely independent way in the two fields. Second, the elegant formalizations and the algorithms that have been proposed for both tasks have had mixed fortune in scaling to large databases. In the paper, we introduce a very general notion of a mapping and cleaning scenario that incorporates a wide variety of features, like, for example, user interventions. We develop a new semantics for these scenarios that represents a conservative extension of previous semantics for schema mappings and data repairing. Based on the semantics, we introduce a chase-based algorithm to compute solutions. Appropriate care is devoted to developing a scalable implementation of the chase algorithm. To the best of our knowledge, this is the first general and scalable proposal in this direction.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127770921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating private synthetic databases for untrusted system evaluation","authors":"Wentian Lu, G. Miklau, Vani Gupta","doi":"10.1109/ICDE.2014.6816689","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816689","url":null,"abstract":"Evaluating the performance of database systems is crucial when database vendors or researchers are developing new technologies. But such evaluation tasks rely heavily on actual data and query workloads that are often unavailable to researchers due to privacy restrictions. To overcome this barrier, we propose a framework for the release of a synthetic database which accurately models selected performance properties of the original database. We improve on prior work on synthetic database generation by providing a formal, rigorous guarantee of privacy. Accuracy is achieved by generating synthetic data using a carefully selected set of statistical properties of the original data which balance privacy loss with relevance to the given query workload. An important contribution of our framework is an extension of standard differential privacy to multiple tables.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131851634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}