2014 IEEE 30th International Conference on Data Engineering最新文献_第8页

Discriminative features for identifying and interpreting outliers 识别和解释异常值的判别特征

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816642

Xuan-Hong Dang, I. Assent, R. Ng, A. Zimek, Erich Schubert

引用次数: 75

Omid: Lock-free transactional support for distributed data stores Omid:分布式数据存储的无锁事务支持

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816691

Daniel Gómez Ferro, F. Junqueira, I. Kelly, B. Reed, M. Yabandeh

{"title":"Omid: Lock-free transactional support for distributed data stores","authors":"Daniel Gómez Ferro, F. Junqueira, I. Kelly, B. Reed, M. Yabandeh","doi":"10.1109/ICDE.2014.6816691","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816691","url":null,"abstract":"In this paper, we introduce Omid, a tool for lock-free transactional support in large data stores such as HBase. Omid uses a centralized scheme and implements snapshot isolation, a property that guarantees that all read operations of a transaction are performed on a consistent snapshot of the data. In a lock-based approach, the unreleased, distributed locks that are held by a failed or slow client block others. By using a centralized scheme for Omid, we are able to implement a lock-free commit algorithm, which does not suffer from this problem. Moreover, Omid lightly replicates a read-only copy of the transaction metadata into the clients where they can locally service a large part of queries on metadata. Thanks to this technique, Omid does not require modifying either the source code of the data store or the tables' schema, and the overhead on data servers is also negligible. The experimental results show that our implementation on a simple dual-core machine can service up to a thousand of client machines. While the added latency is limited to only 10 ms, Omid scales up to 124K write transactions per second. Since this capacity is multiple times larger than the maximum reported traffic in similar systems, we do not expect the centralized scheme of Omid to be a bottleneck even for current large data stores.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123780059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Pigeon: A spatial MapReduce language 鸽子:一种空间MapReduce语言

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816751

A. Eldawy, M. Mokbel

引用次数: 64

Towards effective and efficient mining of arbitrary shaped clusters 对任意形状簇的有效和高效的开采

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816637

H. Huang, Yunjun Gao, K. Chiew, Lei Chen, Qinming He

{"title":"Towards effective and efficient mining of arbitrary shaped clusters","authors":"H. Huang, Yunjun Gao, K. Chiew, Lei Chen, Qinming He","doi":"10.1109/ICDE.2014.6816637","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816637","url":null,"abstract":"Mining arbitrary shaped clusters in large data sets is an open challenge in data mining. Various approaches to this problem have been proposed with high time complexity. To save computational cost, some algorithms try to shrink a data set size to a smaller amount of representative data examples. However, their user-defined shrinking ratios may significantly affect the clustering performance. In this paper, we present CLASP an effective and efficient algorithm for mining arbitrary shaped clusters. It automatically shrinks the size of a data set while effectively preserving the shape information of clusters in the data set with representative data examples. Then, it adjusts the positions of these representative data examples to enhance their intrinsic relationship and make the cluster structures more clear and distinct for clustering. Finally, it performs agglomerative clustering to identify the cluster structures with the help of a mutual k-nearest neighbors-based similarity metric called Pk. Extensive experiments on both synthetic and real data sets are conducted, and the results verify the effectiveness and efficiency of our approach.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Breaking out of the MisMatch trap 打破不匹配陷阱

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816713

Yong Zeng, Z. Bao, T. Ling, H. Jagadish, Guoliang Li

{"title":"Breaking out of the MisMatch trap","authors":"Yong Zeng, Z. Bao, T. Ling, H. Jagadish, Guoliang Li","doi":"10.1109/ICDE.2014.6816713","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816713","url":null,"abstract":"When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results.We call this problem the MisMatch Problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: Target Node Type and Distinguishability. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any LCA-based matching semantics and is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128892981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

ATraPos: Adaptive transaction processing on hardware Islands ATraPos:硬件孤岛上的自适应事务处理

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816692

Danica Porobic, Erietta Liarou, Pınar Tözün, A. Ailamaki

{"title":"ATraPos: Adaptive transaction processing on hardware Islands","authors":"Danica Porobic, Erietta Liarou, Pınar Tözün, A. Ailamaki","doi":"10.1109/ICDE.2014.6816692","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816692","url":null,"abstract":"Nowadays, high-performance transaction processing applications increasingly run on multisocket multicore servers. Such architectures exhibit non-uniform memory access latency as well as non-uniform thread communication costs. Unfortunately, traditional shared-everything database management systems are designed for uniform inter-core communication speeds. This causes unpredictable access latencies in the critical path. While lack of data locality may be a minor nuisance on systems with fewer than 4 processors, it becomes a serious scalability limitation on larger systems due to accesses to centralized data structures. In this paper, we propose ATraPos, a storage manager design that is aware of the non-uniform access latencies of multisocket systems. ATraPos achieves good data locality by carefully partitioning the data as well as internal data structures (e.g., state information) to the available processors and by assigning threads to specific partitions. Furthermore, ATraPos dynamically adapts to the workload characteristics, i.e., when the workload changes, ATraPos detects the change and automatically revises the data partitioning and thread placement to fit the current access patterns and hardware topology. We prototype ATraPos on top of an open-source storage manager Shore-MT and we present a detailed experimental analysis with both synthetic and standard (TPC-C and TATP) benchmarks. We show that ATraPos exhibits performance improvements of a factor ranging from 1.4 to 6.7x for a wide collection of transactional workloads. In addition, we show that the adaptive monitoring and partitioning scheme of ATraPos poses a negligible cost, while it allows the system to dynamically and gracefully adapt when the workload changes.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130720796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Random-walk domination in large graphs 大图形中的随机漫步支配

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816696

Ronghua Li, J. Yu, Xin Huang, Hong Cheng

{"title":"Random-walk domination in large graphs","authors":"Ronghua Li, J. Yu, Xin Huang, Hong Cheng","doi":"10.1109/ICDE.2014.6816696","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816696","url":null,"abstract":"We introduce and formulate two types of random-walk domination problems in graphs motivated by a number of applications in practice (e.g., item-placement problem in online social networks, Ads-placement problem in advertisement networks, and resource-placement problem in P2P networks). Specifically, given a graph G, the goal of the first type of random-walk domination problem is to target k nodes such that the total hitting time of an L-length random walk starting from the remaining nodes to the targeted nodes is minimized. The second type of random-walk domination problem is to find k nodes to maximize the expected number of nodes that hit any one targeted node through an L-length random walk. We prove that these problems are two special instances of the submodular set function maximization with cardinality constraint problem. To solve them effectively, we propose a dynamic-programming (DP) based greedy algorithm which is with near-optimal performance guarantee. The DP-based greedy algorithm, however, is not very efficient due to the expensive marginal gain evaluation. To further speed up the algorithm, we propose an approximate greedy algorithm with linear time complexity w.r.t. the graph size and also with near-optimal performance guarantee. The approximate greedy algorithm is based on carefully designed random walk sampling and sample-materialization techniques. Extensive experiments demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Combining information extraction and human computing for crowdsourced knowledge acquisition 信息提取与人工计算相结合的众包知识获取

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816717

Sarath Kumar Kondreddi, P. Triantafillou, G. Weikum

{"title":"Combining information extraction and human computing for crowdsourced knowledge acquisition","authors":"Sarath Kumar Kondreddi, P. Triantafillou, G. Weikum","doi":"10.1109/ICDE.2014.6816717","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816717","url":null,"abstract":"Automatic information extraction (IE) enables the construction of very large knowledge bases (KBs), with relational facts on millions of entities from text corpora and Web sources. However, such KBs contain errors and they are far from being complete. This motivates the need for exploiting human intelligence and knowledge using crowd-based human computing (HC) for assessing the validity of facts and for gathering additional knowledge. This paper presents a novel system architecture, called Higgins, which shows how to effectively integrate an IE engine and a HC engine. Higgins generates game questions where players choose or fill in missing relations for subject-relation-object triples. For generating multiple-choice answer candidates, we have constructed a large dictionary of entity names and relational phrases, and have developed specifically designed statistical language models for phrase relatedness. To this end, we combine semantic resources like WordNet, ConceptNet, and others with statistics derived from a large Web corpus. We demonstrate the effectiveness of Higgins for knowledge acquisition by crowdsourced gathering of relationships between characters in narrative descriptions of movies and books.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122758383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Mapping and cleaning 映射和清理

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816654

Floris Geerts, G. Mecca, Paolo Papotti, Donatello Santoro

{"title":"Mapping and cleaning","authors":"Floris Geerts, G. Mecca, Paolo Papotti, Donatello Santoro","doi":"10.1109/ICDE.2014.6816654","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816654","url":null,"abstract":"We address the challenging and open problem of bringing together two crucial activities in data integration and data quality, i.e., transforming data using schema mappings, and fixing conflicts and inconsistencies using data repairing. This problem is made complex by several factors. First, schema mappings and data repairing have traditionally been considered as separate activities, and research has progressed in a largely independent way in the two fields. Second, the elegant formalizations and the algorithms that have been proposed for both tasks have had mixed fortune in scaling to large databases. In the paper, we introduce a very general notion of a mapping and cleaning scenario that incorporates a wide variety of features, like, for example, user interventions. We develop a new semantics for these scenarios that represents a conservative extension of previous semantics for schema mappings and data repairing. Based on the semantics, we introduce a chase-based algorithm to compute solutions. Appropriate care is devoted to developing a scalable implementation of the chase algorithm. To the best of our knowledge, this is the first general and scalable proposal in this direction.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127770921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Generating private synthetic databases for untrusted system evaluation 为不受信任的系统评估生成私有合成数据库

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-03-01 DOI: 10.1109/ICDE.2014.6816689

Wentian Lu, G. Miklau, Vani Gupta

引用次数: 17