{"title":"Adaptive-Size Reservoir Sampling over Data Streams","authors":"Mohammed Al-Kateb, B. Lee, X. Wang","doi":"10.1109/SSDBM.2007.29","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.29","url":null,"abstract":"Reservoir sampling is a well-known technique for sequential random sampling over data streams. Conventional reservoir sampling assumes a fixed-size reservoir. There are situations, however, in which it is necessary and/or advantageous to adaptively adjust the size of a reservoir in the middle of sampling due to changes in data characteristics and/or application behavior. This paper studies adaptive size reservoir sampling over data streams considering two main factors: reservoir size and sample uniformity. First, the paper conducts a theoretical study on the effects of adjusting the size of a reservoir while sampling is in progress. The theoretical results show that such an adjustment may bring a negative impact on the probability of the sample being uniform (called uniformity confidence herein). Second, the paper presents a novel algorithm for maintaining the reservoir sample after the reservoir size is adjusted such that the resulting uniformity confidence exceeds a given threshold. Third, the paper extends the proposed algorithm to an adaptive multi-reservoir sampling algorithm for a practical application in which samples are collected from memory-limited wireless sensor networks using a mobile sink. Finally, the paper empirically examines the adaptivity of the multi-reservoir sampling algorithm with regard to reservoir size and sample uniformity using real sensor networks data sets.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116907025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Distributed Algorithm for Joins in Sensor Networks","authors":"Alexandru Coman, M. Nascimento","doi":"10.1109/SSDBM.2007.26","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.26","url":null,"abstract":"Given their autonomy, flexibility and large range of functionality, wireless sensor networks can be used as an effective and discrete means for monitoring data in many domains. Typical sensor nodes are very constrained, in particular regarding their energy and memory resources. Thus, any query processing solution over these devices should consider their limitations. We investigate the problem of processing join queries within a sensor network. Due to the limited memory at nodes, joins are typically processed in a distributed manner over a set of nodes. Previous approaches have either assumed that the join processing nodes have sufficient memory to buffer the subset of the join relations assigned to them, or that the amount of available memory at nodes is known in advance. These assumptions are not realistic for most scenarios. In this context we propose and investigate DIJ, a distributed algorithm for join processing that considers the memory limitations at nodes and does not make a priori assumptions on the available memory at the processing nodes. At the same time, our algorithm still aims at minimizing the energy cost of query processing.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122570212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable Hierarchical Data Storage in Sensor Networks","authors":"Song Lin, Benjamin Arai, D. Gunopulos","doi":"10.1109/SSDBM.2007.39","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.39","url":null,"abstract":"The ability to provide reliable in-network storage while balancing the energy consumption of individual sensors is a primary concern when deploying a sensor network. The main concern with data-centric storage in sensor networks is the ability to provide reliable and load balanced storage. Energy and wireless range constraints make centralized approaches for storage impractical, and in-network data-centric solutions can be used to reduce the number of messages sent over the network. However, these solutions quickly become expensive when combined with fault- tolerance, load balancing and routing. In this paper, we present a novel data-centric storage and query routing mechanism for sensor networks. The routing mechanism is constructed upon the neighborhood information of individual sensors and is completely independent of geographical information. Our data resilient algorithm is capable of recovering from multiple simultaneous failures in the network while adaptively adjusting the load distribution of the newly generated sensor data. Comprehensive experiments on both real-world and synthetic data sets indicate that our approach is more effective and efficient than the previously proposed solutions.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121207853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information-Aware 2^n-Tree for Efficient Out-of-Core Indexing of Very Large Multidimensional Volumetric Data","authors":"Jusub Kim, J. JáJá","doi":"10.1109/SSDBM.2007.15","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.15","url":null,"abstract":"We discuss a new efficient out-of-core multidimensional indexing structure, information-aware 2n-tree, for indexing very large multidimensional volumetric data. Building a series of (n-1)-Dimensional indexing structures on n-Dimensional data causes a scalability problem in the situation of continually growing resolution in every dimension. However, building a single n-Dimensional indexing structure can cause an indexing effectiveness problem compared to the former case. The information-aware 2n-tree is an effort to maximize the indexing structure efficiency by ensuring that the subdivision of space have as similar coherence as possible along each dimension. It is particularly useful when data distribution along each dimension constantly shows a different degree of coherence from each other dimension. Our preliminary results show that our new tree can achieve higher indexing structure efficiency than previous methods.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124519387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAMCost: Global and Local Estimates leading to Robust Cost Estimation of Similarity Queries","authors":"Gisele Busichia Baioco, A. Traina, C. Traina","doi":"10.1109/SSDBM.2007.17","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.17","url":null,"abstract":"This paper presents an effective cost model to estimate the number of disk accesses (I/O cost) and the number of distance calculations (CPU cost) to process similarity queries over data indexed by metric access methods. Two types of similarity queries were taken into consideration: range and k-nearest neighbor queries. The main point of the cost model is considering not only global parameters of the data set but also the local data distribution. The model takes advantage of the intrinsic dimension of the data set, estimated by its correlation fractal dimension. Experiments were performed on real and synthetic data sets, with different sizes and dimensions, in order to validate the proposed model. They confirmed that the estimations are accurate, within the range achieved by real queries.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121907218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-based Optimization of Complex Scientific Queries","authors":"R. Fomkin, T. Risch","doi":"10.1109/SSDBM.2007.8","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.8","url":null,"abstract":"High energy physics scientists analyze large amounts of data looking for interesting events when particles collide. These analyses are easily expressed using complex queries that filter events. We developed a cost model for aggregation operators and other functions used in such queries and show that it substantially improves performance. However, the query optimizer still produces suboptimal plans because of estimate errors. Furthermore, the optimization is very slow because of the large query size. We improved the optimization by a profiled grouping strategy where the scientific query is first automatically fragmented into subqueries based on application knowledge. Each fragment is then independently profiled on a sample of events to measure real execution cost and cardinality. An optimized fragmented query is shown to execute faster than a query optimized with the cost model alone. Furthermore, the total optimization time, including fragmentation and profiling, is substantially improved.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129243602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Update Conscious Bitmap Indices","authors":"G. Canahuate, Michael Gibas, H. Ferhatosmanoğlu","doi":"10.1109/SSDBM.2007.24","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.24","url":null,"abstract":"Bitmap indices have been widely used in several domains such as data warehousing and scientific applications due to their efficiency in answering certain query types over large data sets. However, their utilization has been largely limited to read-only data sets or to static snapshots of data due to the cost associated with the update and append of new data. Typically, several bitmaps are associated with each indexed attribute in a table, i.e. one for each attribute value, bin, or range. Each one of these bitmaps needs to be updated to reflect a new, appended row. Since a given table could be represented by hundreds or even thousands of bitmaps, the insertion of a single record can be prohibitively costly. In order to transfer the fast query response times offered by bitmap indices to dynamic database domains, we propose an update conscious bitmap index that provides a mechanism to quickly update bitmaps to reflect dynamic database changes. For an insert operation only the bitmaps that represent the values being inserted need to be updated. We formalize the insert and delete operations of the proposed technique and provide a cost model for bitmap updates. We compare the update conscious bitmaps to traditional bitmaps in terms of storage space, update performance, and query execution time.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"147 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128836061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene Ontology-Based Annotation Analysis and Categorization of Metabolic Pathways","authors":"A. Cakmak","doi":"10.1109/SSDBM.2007.35","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.35","url":null,"abstract":"Functional characterizations of pathways provide new opportunities in defining, understanding, and comparing existing biological pathways, and in helping discover new ones in different organisms. In this paper, we present and evaluate computational techniques for categorizing pathways, based upon the Gene Ontology (GO) annotations of enzymes within metabolic pathways. Our approach is to use the notion of functionality templates, GO-functional graphs of pathways. Pathway categorization is then achieved through learning models built on different characteristics of functionality templates. We have experimentally evaluated the accuracy of automated pathway categorization with respect to different learning models and their parameters. Using KEGG metabolic pathways, the pathway categorization tool reaches to 90% and higher accuracy.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131555911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining RNA Tertiary Motifs with Structure Graphs","authors":"Xueyi Wang, Jun Huan, J. Snoeyink, Wei Wang","doi":"10.1109/SSDBM.2007.38","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.38","url":null,"abstract":"We present a novel application of graph database mining to identify tertiary motifs in RNA structures. In our method, we abstract an RNA molecule as a labeled graph and use a frequent subgraph mining technique to derive tertiary motifs. By applying our technique to ribosome RNA and transfer RNA, we have identified known RNA tertiary motifs such as the ribose zipper and U-turn, plus candidates for novel tertiary motifs. Finally, we suggest an iterative multiple structure alignment algorithm to classify tertiary motifs and generate consensus motifs.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134645352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maintaining K-Anonymity against Incremental Updates","authors":"J. Pei, Jian Xu, Zhibin Wang, Wei Wang, Ke Wang","doi":"10.1109/SSDBM.2007.16","DOIUrl":"https://doi.org/10.1109/SSDBM.2007.16","url":null,"abstract":"K-anonymity is a simple yet practical mechanismto protect privacy against attacks of re-identifying individuals by joining multiple public data sources. All existing methods achieving k-anonymity assume implicitly that the data objects to be anonymized are given once and fixed. However, in many applications, the real world data sources are dynamic. In this paper, we investigate the problem of maintaining k-anonymity against incremental updates, and propose a simple yet effective solution. We analyze how inferences from multiple releases may temper the k-anonymity of data, and propose the monotonic incremental anonymization property. The general idea is to progressively and consistently reduce the generalization granularity as incremental updates arrive. Our new approach guarantees the k-anonymity on each release, and also on the inferred table using multiple releases. At the same time, our new approach utilizes the more and more accumulated data to reduce the information loss.","PeriodicalId":122925,"journal":{"name":"19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124955328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}