{"title":"Self-deadlocks in disparate scientific data management systems","authors":"F. Pentaris, Y. Ioannidis","doi":"10.1109/SSDBM.2004.63","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.63","url":null,"abstract":"In large statistical and scientific data management environments, where mediation architectures are used to integrate disparate and autonomous systems, a new problem - self-deadlock - may cause global transaction failures. In this short paper we briefly examine the reasons causing this problem and identify some algorithms for resolving it.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114882336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient query processing on relational data-partitioning index structures","authors":"H. Kriegel, Peter Kunath, M. Pfeifle, M. Renz","doi":"10.1109/SSDBM.2004.32","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.32","url":null,"abstract":"In contrast to space-partitioning index structures, data-partitioning index structures naturally adapt to the actual data distribution which results in a very good query response behavior. Besides efficient query processing, modern database applications including computer-aided design, medical imaging, or molecular biology require fully-fledged database management systems in order to guarantee industrial-strength. In this paper, we show how we can achieve efficient query processing on data-partitioning index structures within general purpose database systems. We reduce the navigational index traversal cost by using \"extended index range scans\". If a directory node is \"largely\" covered by the actual query, the recursive tree traversal for this node can beneficially be replaced by a scan on the leaf level of the index instead of navigating through the directory any longer. On the other hand, for highly selective queries, the index is used as usual. In this paper, we demonstrate the benefits of this idea for spatial collision queries on the relational R-tree. Our experiments with an Oracle9i database system show that our new approach outperforms common index structures and the sequential scan considerably.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A weight-based map matching method in moving objects databases","authors":"Huabei Yin, O. Wolfson","doi":"10.1109/SSDBM.2004.10","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.10","url":null,"abstract":"In location management, the trajectory represents the motion of a moving object in 3D space-time, i.e., a sequence (x, y, t). Unfortunately, location technologies, cannot guarantee error-freedom. Thus, map matching (a.k.a. snapping), matching a trajectory to the roads on the map, is necessary. We introduce a weight-based map matching method, and experimentally show that, for the offline situation, on average, our algorithm can get up to 94% correctness depending on the GPS sampling interval.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125542266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Kapetanios, David Baer, Björn Glaus, Paul Groenewoud
{"title":"MDDQL-Stat: data querying and analysis through integration of intentional and extensional semantics","authors":"E. Kapetanios, David Baer, Björn Glaus, Paul Groenewoud","doi":"10.1109/SSDBM.2004.49","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.49","url":null,"abstract":"We would like to present a prototype system enabling a rather empirical than a formal approach to the problem of posing queries to a semantically rich (quality aspects, semantic distance, etc.) data integration system {G,S,M} (Global schema, Sources, Mediation) through integration not only of intensional but also of extensional semantics. While the first is provided by an alphabet A as given by an ontology based global schema C, and a high level query language (conjunction/disjunction + inequalities + statistical operations), the latter enables synthesizing of data source specific and previously transformed query results according to well-defined set operations for heterogeneous, distributed data sources. Our approach contrasts with other GAV (Global-As-View) related architectures for mediation of integrated read-only views, in that it simplifies query processing while preserving flexibility when adding new data sources, despite the inherited complexity of mappings due to enhanced semantic description of data (semantic distance, quality parameters, etc.) such that statistical results and comparisons become more meaningful.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127332349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoPart: automating schema design for large scientific databases using data partitioning","authors":"Stratos Papadomanolakis, A. Ailamaki","doi":"10.1109/SSDBM.2004.19","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.19","url":null,"abstract":"Database applications that use multi-terabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology. Scientific databases are particularly suited for the application of automated physical design techniques, because of their data volume and the complexity of the scientific workloads. Current automated physical design tools focus on the selection of indexes and materialized views. In large-scale scientific databases, however the data volume and the continuous insertion of new data allows for only limited indexes and materialized views. By contrast, data partitioning does not replicate data, thereby reducing space requirements and minimizing update overhead. In this paper we present AutoPart, an algorithm that automatically partitions database tables to optimize sequential access assuming prior knowledge of a representative workload. The resulting schema is indexed using a fraction of the space required for indexing the original schema. To evaluate AutoPart we built an automated schema design tool that interfaces to commercial database systems. We experiment with AutoPart in the context of the Sloan Digital Sky Survey database, a real-world astronomical database, running on SQL Server 2000. Our experiments demonstrate the benefits of partitioning for large-scale systems: partitioning alone improves query execution performance by a factor of two on average. Combined with indexes, the new schema also outperforms the indexed original schema by 20% (for queries) and a factor of five (for updates), while using only half the original index space.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125102143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable approach to approximating aggregate queries over intermittent streams","authors":"Shanzhong Zhu, C. Ravishankar","doi":"10.1109/SSDBM.2004.6","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.6","url":null,"abstract":"We present a novel approach to approximate evaluation of standing aggregate queries over streaming data, subject to user-specified error bounds. Our method models the behavior of aggregates as Brownian motions, and adoptively updates the model according to stream characteristics. This approach has two advantages. First, it greatly improves system scalability since we can defer query evaluation as long as the difference between the returned and true aggregate values remains within user-specified bounds. Second, we are able to provide approximate answers during stream interruptions by estimating the rate at which the streams and the aggregate drift during the blackout periods. We also study processor allocation issues in such approximate aggregate evaluation systems. Our experiments show that our model captures the behavior of real-world streams such as sensor data and stock traces with excellent fidelity, and scales very well for large numbers of standing queries.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125762317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast algorithm for subspace clustering by pattern similarity","authors":"Haixun Wang, F. Chu, W. Fan, Philip S. Yu, J. Pei","doi":"10.1109/SSDBM.2004.3","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.3","url":null,"abstract":"Unlike traditional clustering methods that focus on grouping objects with similar values on a set of dimensions, clustering by pattern similarity finds objects that exhibit a coherent pattern of rise and fall in subspaces. Pattern-based clustering extends the concept of traditional clustering and benefits a wide range of applications, including large scale scientific data analysis, target marketing, Web usage analysis, etc. However, state-of-the-art pattern-based clustering methods (e.g., the pCluster algorithm) can only handle data sets of thousands of records, which makes them inappropriate for many real-life applications. Furthermore, besides the huge data volume, many data sets are also characterized by their sequentiality, for instance, customer purchase records and network event logs are usually modeled as data sequences. Hence, it becomes important to enable pattern-based clustering methods i) to handle large datasets, and ii) to discover pattern similarity embedded in data sequences. In this paper, we present a novel algorithm that offers this capability. Experimental results from both real life and synthetic datasets prove its effectiveness and efficiency.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121960808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SaIL: a library for efficient application integration of spatial indices","authors":"Marios Hadjieleftheriou, E. Hoel, V. Tsotras","doi":"10.1109/SSDBM.2004.60","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.60","url":null,"abstract":"Many scientific applications deal with spatial, spatiotemporal and other multidimensional indexing structures, typically managing millions of objects with arbitrary and complex features. Choosing the appropriate method to index such data becomes rather difficult. Having an index library that can combine different indices under the same programming interface is thus very valuable. In this paper we present SalL (SpAtial Index Library), a robust and extensible library that enables simple integration of spatial index structures in existing applications. We mainly focus on design issues and elaborate on techniques for making the framework generic enough, so that it can support user defined data types, customizable spatial queries, and a broad range of spatial (and spatiotemporal) index structures. The library is publicly available and has already been successfully utilized for research and commercial applications.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114356075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms","authors":"Yun Chi, Yirong Yang, R. Muntz","doi":"10.1109/SSDBM.2004.41","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.41","url":null,"abstract":"Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we present HybridTreeMiner, a computationally efficient algorithm that discovers all frequently occurring subtrees in a database of rooted unordered trees. The algorithm mines frequent subtrees by traversing an enumeration tree that systematically enumerates all subtrees. The enumeration tree is defined based on a novel canonical form for rooted unordered trees - the breadth-first canonical form (BFCF). By extending the definitions of our canonical form and enumeration tree to free trees, our algorithm can efficiently handle databases of free trees as well. We study the performance of our algorithms through extensive experiments based on both synthetic data and datasets from real applications. The experiments show that our algorithm is competitive in comparison to known rooted tree mining algorithms and is faster by one to two orders of magnitudes compared to a known algorithm for mining frequent free trees.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132929912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A shrinking-based dimension reduction approach for multi-dimensional analysis","authors":"Yong Shi, A. Zhang","doi":"10.1109/SSDBM.2004.8","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.8","url":null,"abstract":"In this paper, we present continuous research on data analysis based on our previous work on the shrinking approach. Shrinking is a novel data preprocessing technique which optimizes the inner structure of data inspired by the Newton's Universal Law of Gravitation in the real world. It can be applied in many data mining fields. Following our previous work on the shrinking method for multidimensional data analysis in full data space, we propose a shrinking-based dimension reduction approach which tends to solve the dimension reduction problem from a new perspective. In this approach data are moved along the direction of the density gradient, thus making the inner structure of data more prominent. It is conducted on a sequence of grids with different cell sizes. Dimension reduction process is performed based on the difference of the data distribution projected on each dimension before and after the data-shrinking process. Those dimensions with dramatic variation of data distribution through the data-shrinking process are selected as good dimension candidates for further data analysis. This approach can assist to improve the performance of existing data analysis approaches. We demonstrate how this shrinking-based dimension reduction approach affects the clustering results of well known algorithms.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124936116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}