{"title":"Upper bound on the length of generalized disjunction-free patterns","authors":"Marzena Kryszkiewicz","doi":"10.1109/SSDBM.2004.72","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.72","url":null,"abstract":"A number of lossless representations of frequent patterns were proposed in recent years. The representation that consists of all frequent closed itemsets and the representations based on generalized disjunction-free patterns or on non-derivable itemsets are proven the most concise ones. Experiments show further that the latter ones are by a few orders of magnitude more concise (and determinable) than the former one. As follows from experiments, the representations based on generalized disjunction-free patterns are also more concise than the available in the literature representations of frequent patterns, which determine supports of patterns in an approximate way. In this paper, we provide an upper bound on the length of generalized disjunction-free patterns. The bound determines the maximum number of scans of the database carried out by a priori-like algorithms discovering the representations based on generalized disjunction-free patterns.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115188520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast computation of Iceberg Dwarf","authors":"Longgang Xiang, Feng Yucai","doi":"10.1109/SSDBM.2004.36","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.36","url":null,"abstract":"Iceberg Dwarf (IceDwarf for short) combines the strength of Iceberg-Cube and Dwarf. It exploits the elegant Dwarf structure for cube tuple store and eliminates those unsatisfied sub-dwarfs. By only storing nontrivial cube tuples, IceDwarf reduces the size of a dwarf significantly; even Dwarf itself compresses the data cube effectively. We studied how to efficiently compute icedwarfs, and developed a straightforward algorithm (PAC). To further improve the performance, we explored the structure of Dwarf and presented four nice lemmas. Based on these observations, we proposed a new algorithm called PWC. It builds the IceDwarf by bottom-up computing all the partitions of a fact table and inserting them into the Dwarf structure, enabling Apriori-like pruning and single tuple partition optimization, and facilitating the detection of suffix redundancies. Our performance study showed that PWC is highly efficient and runs much faster than PAC for icedwarfs, even for computing full dwarfs.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128548779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient similarity search in streaming time sequences","authors":"Maria Kontaki, A. Papadopoulos","doi":"10.1109/SSDBM.2004.33","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.33","url":null,"abstract":"Query processing in data streams is a very important research direction. The challenge in a database of data streams is to provide efficient algorithms and access methods for query processing, taking into consideration the fact that the database changes continuously as new data arrive. Traditional access methods that continuously update the data are considered inefficient, due to the significant update costs. In this paper we present IDC-Index, an efficient technique for similarity query processing in streaming time sequences, which is based on a multidimensional access method enhanced with a deferred update policy and an incremental computation of the discrete Fourier transform (DFT), which is used as a feature extraction method. The method manages to reduce the number of false alarms examined and therefore achieves high answers/candidates ratio. Moreover, an extensive performance evaluation based on synthetic random walk and real time sequences have shown that the proposed technique outperforms significantly existing approaches for similarity range query processing.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128571516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelizing clustering of geoscientific data sets using data streams","authors":"Silvia Nittel, Kelvin T. Leung","doi":"10.1109/SSDBM.2004.58","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.58","url":null,"abstract":"Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"85 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116305788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal range exploration of large scale multidimensional time series data","authors":"J. JáJá, Jusub Kim, Qin Wang","doi":"10.1109/SSDBM.2004.68","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.68","url":null,"abstract":"We consider the problem of querying large scale multidimensional time series data to discover events of interest, test and validate hypotheses, or to associate temporal patterns with specific events. Large amounts of multidimensional time series data are currently available, and this type of data is growing at a fast rate due to the current trends in collecting time series of business, scientific, demographic, and simulation data. The ability to explore such collections interactively, even at a coarse level, will be critical in discovering the information and knowledge embedded in such collections. We develop indexing techniques and search algorithms to efficiently handle temporal range value querying of multidimensional time series data. Our indexing uses linear space data structures that enable the handling of queries very efficiently, invoking in the worst case a logarithmic number of queries to single time slices. We also show that our algorithm is ideally suited for parallel implementation on clusters of processors achieving a linear speedup in the number of available processors. A particularly simple data structure with provably good bounds is also presented for the case when the number of multidimensional objects is relatively small. These techniques improve significantly over previous techniques for either the serial or the parallel case, and are evaluated by extensive experimental results that confirm their superior performance.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124059201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. McClean, B. Scotney, Hans Rutjes, J. Hartkamp, Isambo Karali, M. Hatzopoulos, J. Lamb, Defeng Ma
{"title":"MISSION: an agent-based system for semantic integration of heterogeneous distributed statistical information sources","authors":"S. McClean, B. Scotney, Hans Rutjes, J. Hartkamp, Isambo Karali, M. Hatzopoulos, J. Lamb, Defeng Ma","doi":"10.1109/SSDBM.2004.52","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.52","url":null,"abstract":"The MISSION system utilises query agents, in particular the matching and negotiation agents that are responsible for pre-integration where the matching agent decomposes the query into sub-queries, and then searches metadata to find datasets that match the query fragments. Such an approach provides a capability of automating the process of executing queries on heterogeneous statistical databases that are distributed over the Internet. The novelty lies in the provision of automated methods for statistical aggregation, where the heterogeneity essentially resides in the classification schemes of categorical data, including both heterogeneity of nomenclature and heterogeneity of granularity. In addition, our solution permits queries to be specified in a goal-driven query-by-example format. Rather than impose an a priori global standard, the user can query through a unified interface where integration is done at run-time.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125962891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"All-nearest-neighbors queries in spatial databases","authors":"Jun Zhang, N. Mamoulis, D. Papadias, Yufei Tao","doi":"10.1109/SSDBM.2004.12","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.12","url":null,"abstract":"Given two sets A and B of multidimensional objects, the all-nearest-neighbors (ANN) query retrieves for each object in A its nearest neighbor in B. Although this operation is common in several applications, it has not received much attention in the database literature. In this paper we study alternative methods for processing ANN queries depending on whether A and B are indexed: Our algorithms are evaluated through extensive experimentation using synthetic and real datasets. The performance studies show that they are an order of magnitude faster than a previous approach based on closest-pairs query processing.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132481430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Monte Carlo sampling method for drawing representative samples from large databases","authors":"Hong Guo, W. Hou, Feng Yan, Qiang Zhu","doi":"10.1109/SSDBM.2004.5","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.5","url":null,"abstract":"Sampling is important in areas like data mining, OLAP, selectivity estimation, clustering, etc. It has also become a necessity in social, economical, engineering, scientific, and statistical studies where databases are too large to handle. In this paper, a sampling method based on the Metropolis algorithm is proposed. Unlike the conventional uniform sampling methods, this method is able to select objects consistent with the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125511009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On integrating scientific resources through semantic registration","authors":"S. Bowers, K. Lin, Bertram Ludäscher","doi":"10.1109/SSDBM.2004.56","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.56","url":null,"abstract":"In many data-centric scientific applications it is common to register datasets and computational services with a federation registry (also commonly called a catalog, directory, or repository). For example, the scientific data-handling system under development in the SEEK project must consider various dataset registries, including: MCAT, for access to SRB-registered datasets Metacat, for KNB-registered datasets DiGIR, for UDDI-registered data and Xanthoria, an XML-based data registry. A challenge for SEEK, and similar efforts such as GEON is to provide uniform access to registries and registered resources, based on emerging Web and grid standards.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"379 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116578488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Lacroix, Tiffany Morris, K. Parekh, L. Raschid, Maria-Esther Vidal
{"title":"Exploiting multiple paths to express scientific queries","authors":"Z. Lacroix, Tiffany Morris, K. Parekh, L. Raschid, Maria-Esther Vidal","doi":"10.1109/SSDBM.2004.34","DOIUrl":"https://doi.org/10.1109/SSDBM.2004.34","url":null,"abstract":"The purpose of this demonstration is to present the main features of the BioNavigation system. Scientific data collection needed in various stages of scientific discovery is typically performed manually. For each scientific object of interest (e.g., a gene, a sequence), scientists query a succession of Web resources following links between retrieved entries. Each of the steps provides part of the intended characterization of the scientific object. This process is sometimes partially supported by hard-coded scripts or complex queries that will be evaluated by a mediation-based data integration system or against a data warehouse. These approaches fail in guiding the scientists during the collection process. In contrast, the BioNavigation approach presented in the paper provides the scientists with information on the available alternative resources, their provenance, and the costs of data collection. The BioNavigation system enhances a mediation-based integration system and provides scientists with support for the following: to ask queries at a high conceptual level; to visualize the multiple alternative resources that may be exploited to execute their data collection queries; to choose the final execution path to evaluate their queries.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}