{"title":"Memory-constrained aggregate computation over data streams","authors":"K. Naidu, R. Rastogi, Scott Satkin, A. Srinivasan","doi":"10.1109/ICDE.2011.5767860","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767860","url":null,"abstract":"In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131573115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang
{"title":"A unified approach for computing top-k pairs in multidimensional space","authors":"M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang","doi":"10.1109/ICDE.2011.5767903","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767903","url":null,"abstract":"Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134031739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On dimensionality reduction of massive graphs for indexing and retrieval","authors":"C. Aggarwal, Haixun Wang","doi":"10.1109/ICDE.2011.5767834","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767834","url":null,"abstract":"In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129278334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top-k keyword search over probabilistic XML data","authors":"Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang","doi":"10.1109/ICDE.2011.5767875","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767875","url":null,"abstract":"Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122502955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CT-index: Fingerprint-based graph indexing combining cycles and trees","authors":"K. Klein, Nils M. Kriege, Petra Mutzel","doi":"10.1109/ICDE.2011.5767909","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767909","url":null,"abstract":"Efficient subgraph queries in large databases are a time-critical task in many application areas as e.g. biology or chemistry, where biological networks or chemical compounds are modeled as graphs. The NP-completeness of the underlying subgraph isomorphism problem renders an exact subgraph test for each database graph infeasible. Therefore efficient methods have to be found that avoid most of these tests but still allow to identify all graphs containing the query pattern. We propose a new approach based on the filter-verification paradigm, using a new hash-key fingerprint technique with a combination of tree and cycle features for filtering and a new subgraph isomorphism test for verification. Our approach is able to cope with edge and vertex labels and also allows to use wild card patterns for the search. We present an experimental comparison of our approach with state-of-the-art methods using a benchmark set of both real world and generated graph instances that shows its practicability. Our approach is implemented as part of the Scaffold Hunter software, a tool for the visual analysis of chemical compound databases.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127111389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots","authors":"A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2011.5767867","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767867","url":null,"abstract":"The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121793806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis
{"title":"Flexible use of cloud resources through profit maximization and price discrimination","authors":"Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis","doi":"10.1109/ICDE.2011.5767932","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767932","url":null,"abstract":"Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128116633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"T-verifier: Verifying truthfulness of fact statements","authors":"Xian Li, W. Meng, Clement T. Yu","doi":"10.1109/ICDE.2011.5767859","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767859","url":null,"abstract":"The Web has become the most popular place for people to acquire information. Unfortunately, it is widely recognized that the Web contains a significant amount of untruthful information. As a result, good tools are needed to help Web users determine the truthfulness of certain information. In this paper, we propose a two-step method that aims to determine whether a given statement is truthful, and if it is not, find out the truthful statement most related to the given statement. In the first step, we try to find a small number of alternative statements of the same topic as the given statement and make sure that one of these statements is truthful. In the second step, we identify the truthful statement from the given statement and the alternative statements. Both steps heavily rely on analysing various features extracted from the search results returned by a popular search engine for appropriate queries. Our experimental results show the best variation of the proposed method can achieve a precision of about 90%.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128038529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HashFile: An efficient index structure for multimedia data","authors":"Dongxiang Zhang, D. Agrawal, Gang Chen, A. Tung","doi":"10.1109/ICDE.2011.5767837","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767837","url":null,"abstract":"Nearest neighbor (NN) search in high dimensional space is an essential query in many multimedia retrieval applications. Due to the curse of dimensionality, existing index structures might perform even worse than a simple sequential scan of data when answering exact NN query. To improve the efficiency of NN search, locality sensitive hashing (LSH) and its variants have been proposed to find approximate NN. They adopt hash functions that can preserve the Euclidean distance so that similar objects have a high probability of colliding in the same bucket. Given a query object, candidate for the query result is obtained by accessing the points that are located in the same bucket. To improve the precision, each hash table is associated with m hash functions to recursively hash the data points into smaller buckets and remove the false positives. On the other hand, multiple hash tables are required to guarantee a high retrieval recall. Thus, tuning a good tradeoff between precision and recall becomes the main challenge for LSH. Recently, locality sensitive B-tree(LSB-tree) has been proposed to ensure both quality and efficiency. However, the index uses random I/O access. When the multimedia database is large, it requires considerable disk I/O cost to obtain an approximate ratio that works in practice. In this paper, we propose a novel index structure, named HashFile, for efficient retrieval of multimedia objects. It combines the advantages of random projection and linear scan. Unlike the LSH family in which each bucket is associated with a concatenation of m hash values, we only recursively partition the dense buckets and organize them as a tree structure. Given a query point q, the search algorithm explores the buckets near the query object in a top-down manner. The candidate buckets in each node are stored sequentially in increasing order of the hash value and can be efficiently loaded into memory for linear scan. HashFile can support both exact and approximate NN queries. Experimental results show that HashFile performs better than existing indexes both in answering both types of NN queries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121251195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secure and efficient in-network processing of exact SUM queries","authors":"Stavros Papadopoulos, A. Kiayias, D. Papadias","doi":"10.1109/ICDE.2011.5767886","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767886","url":null,"abstract":"In-network aggregation is a popular methodology adopted in wireless sensor networks, which reduces the energy expenditure in processing aggregate queries (such as SUM, MAX, etc.) over the sensor readings. Recently, research has focused on secure in-network aggregation, motivated (i) by the fact that the sensors are usually deployed in open and unsafe environments, and (ii) by new trends such as outsourcing, where the aggregation process is delegated to an untrustworthy service. This new paradigm necessitates the following key security properties: data confidentiality, integrity, authentication, and freshness. The majority of the existing work on the topic is either unsuitable for large-scale sensor networks, or provides only approximate answers for SUM queries (as well as their derivatives, e.g., COUNT, AVG, etc). Moreover, there is currently no approach offering both confidentiality and integrity at the same time. Towards this end, we propose a novel and efficient scheme called SIES. SIES is the first solution that supports Secure In-network processing of Exact SUM queries, satisfying all security properties. It achieves this goal through a combination of homomorphic encryption and secret sharing. Furthermore, SIES is lightweight (it relies on inexpensive hash operations and modular additions/multiplications), and features a very small bandwidth consumption (in the order of a few bytes). Consequently, SIES constitutes an ideal method for resource-constrained sensors.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126637444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}