{"title":"Online mining of data streams: applications, techniques and progress","authors":"Haixun Wang, J. Pei, Philip S. Yu","doi":"10.1109/ICDE.2005.101","DOIUrl":"https://doi.org/10.1109/ICDE.2005.101","url":null,"abstract":"In this paper, we focus on the differences between mining static large data sets and data streams. Over the years, the database and data mining community have learned valuable lessons from mining static large data sets, and developed many useful algorithms and tools for this purpose. The paper aims at providing a shortcut to the current frontier of stream mining research. We emphasize the research problems, the inherent technical challenges and the latest results. Particularly, the paper highlights new challenges and potential research interests. Research community has been interested in the integration between data mining tasks and database management systems.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133290733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches","authors":"J. Hammer, M. Stonebraker, Oguzhan Topsakal","doi":"10.1109/ICDE.2005.140","DOIUrl":"https://doi.org/10.1109/ICDE.2005.140","url":null,"abstract":"We introduce our new, publicly available testbed and benchmark called THALIA (Test Harness for the Assessment of Legacy information Integration Approaches) for testing and evaluating integration technologies. THALIA provides researchers with a collection of 40 downloadable data sources representing University course catalogs from computer science departments worldwide. In addition, THALIA currently provides a set of twelve challenge queries as well as a scoring function for ranking the performance of an integration system. A second contribution is a systematic classification of the types of syntactic and semantic heterogeneities, which directly lead to the twelve challenge. We have chosen course information as our domain of discourse because it is well known and easy to understand. Furthermore, there is an abundance of data sources publicly available that allowed us to develop a testbed exhibiting all of the syntactic and semantic heterogeneities that we have identified.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122406733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Megalooikonomou, Qiang Wang, Guo Li, C. Faloutsos
{"title":"A multiresolution symbolic representation of time series","authors":"V. Megalooikonomou, Qiang Wang, Guo Li, C. Faloutsos","doi":"10.1109/ICDE.2005.10","DOIUrl":"https://doi.org/10.1109/ICDE.2005.10","url":null,"abstract":"Efficiently and accurately searching for similarities among time series and discovering interesting patterns is an important and non-trivial problem. In this paper, we introduce a new representation of time series, the multiresolution vector quantized (MVQ) approximation, along with a new distance function. The novelty of MVQ is that it keeps both local and global information about the original time series in a hierarchical mechanism, processing the original time series at multiple resolutions. Moreover, the proposed representation is symbolic employing key subsequences and potentially allows the application of text-based retrieval techniques into the similarity analysis of time series. The proposed method is fast and scales linearly with the size of database and the dimensionality. Contrary to the vast majority in the literature that uses the Euclidean distance, MVQ uses a multi-resolution/hierarchical distance function. We performed experiments with real and synthetic data. The proposed distance function consistently outperforms all the major competitors (Euclidean, dynamic time warping, piecewise aggregate approximation) achieving up to 20% better precision/recall and clustering accuracy on the tested datasets.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122833431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Malin, E. Airoldi, Samuel Edoho-Eket, Yiheng Li
{"title":"Configurable security protocols for multi-party data analysis with malicious participants","authors":"B. Malin, E. Airoldi, Samuel Edoho-Eket, Yiheng Li","doi":"10.1109/ICDE.2005.37","DOIUrl":"https://doi.org/10.1109/ICDE.2005.37","url":null,"abstract":"Standard multi-party computation models assume semi-honest behavior, where the majority of participants implement protocols according to specification, an assumption not always plausible. In this paper we introduce a multi-party protocol for collaborative data analysis when participants are malicious and fail to follow specification. The protocol incorporates a semi-trusted third party, which analyzes encrypted data and provides honest responses that only intended recipients can successfully decrypt. The protocol incorporates data confidentiality by enabling participants to receive encrypted responses tailored to their own encrypted data submissions without revealing plaintext to other participants, including the third party. As opposed to previous models, trust need only be placed on a single participant with no data at stake. Additionally, the proposed protocol is configurable in a way that security features are controlled by independent subprotocols. Various combinations of subprotocols allow for a flexible security system, appropriate for a number of distributed data applications, such as secure list comparison.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124986045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive caching for continuous queries","authors":"S. Babu, Kamesh Munagala, J. Widom, R. Motwani","doi":"10.1109/ICDE.2005.15","DOIUrl":"https://doi.org/10.1109/ICDE.2005.15","url":null,"abstract":"We address the problem of executing continuous multiway join queries in unpredictable and volatile environments. Our query class captures windowed join queries in data stream systems as well as conventional maintenance of materialized join views. Our adaptive approach handles streams of updates whose rates and data characteristics may change over time, as well as changes in system conditions such as memory availability. In this paper we focus specifically on the problem of adaptive placement and removal of caches to optimize join performance. Our approach automatically considers conventional tree-shaped join plans with materialized subresults at every intermediate node, sub result-free MJoins, and the entire spectrum between them. We provide algorithms for selecting caches, monitoring their cost and benefits in current conditions, allocating memory to caches, and adapting as conditions change. All of our algorithms are implemented in the STREAM prototype data stream management system and a thorough experimental evaluation is included.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On discovery of extremely low-dimensional clusters using semi-supervised projected clustering","authors":"Kevin Y. Yip, D. Cheung, M. Ng","doi":"10.1109/ICDE.2005.96","DOIUrl":"https://doi.org/10.1109/ICDE.2005.96","url":null,"abstract":"Recent studies suggest that projected clusters with extremely low dimensionality exist in many real datasets. A number of projected clustering algorithms have been proposed in the past several years, but few can identify clusters with dimensionality lower than 10% of the total number of dimensions, which are commonly found in some real datasets such as gene expression profiles. In this paper we propose a new algorithm that can accurately identify projected clusters with relevant dimensions as few as 5% of the total number of dimensions. It makes use of a robust objective function that combines object clustering and dimension selection into a single optimization problem. The algorithm can also utilize domain knowledge in the form of labeled objects and labeled dimensions to improve its clustering accuracy. We believe this is the first semi-supervised projected clustering algorithm. Both theoretical analysis and experimental results show that by using a small amount of input knowledge, possibly covering only a portion of the underlying classes, the new algorithm can be further improved to accurately detect clusters with only 1% of the dimensions being relevant. The algorithm is also useful in getting a target set of clusters when there are multiple possible groupings of the objects.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128345870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A relationally complete visual query language for heterogeneous data sources and pervasive querying","authors":"S. Polyviou, G. Samaras, P. Evripidou","doi":"10.1109/ICDE.2005.12","DOIUrl":"https://doi.org/10.1109/ICDE.2005.12","url":null,"abstract":"In this paper we introduce and formally define Query by Browsing (QBB), a scalable, relationally complete visual query language based on the desktop user interface paradigm and tuple relational calculus that allows the formulation of complex queries over relational, entity-relationship, object-oriented and XML data sources on a variety of handheld and desktop platforms. It is to our knowledge the first visual query language to combine the important characteristics of usability, scalability, expressive power and flexibility. We support these claims by demonstrating the similarity of the QBB paradigm to the popular desktop user interface paradigm, by relating it to relational calculus and relational algebra and by describing Chiromancer II, a Web-based implementation of the QBB paradigm for handheld devices. We also discuss ways in which non-relational sources can be represented and queried and compare QBB to related work in the area of visual query languages for a variety of data models. We finally offer conclusions and thoughts for future work.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124012739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston
{"title":"Finding (recently) frequent items in distributed data streams","authors":"A. Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston","doi":"10.1109/ICDE.2005.68","DOIUrl":"https://doi.org/10.1109/ICDE.2005.68","url":null,"abstract":"We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naive methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naive approaches while providing the same error guarantees on answers.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126933491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards building a MetaQuerier: extracting and matching Web query interfaces","authors":"Bin He, Zhen Zhang, K. Chang","doi":"10.1109/ICDE.2005.145","DOIUrl":"https://doi.org/10.1109/ICDE.2005.145","url":null,"abstract":"We witness the rapid growth and thus the prevalence of databases on the Web. Our recent study in April 2004 estimated 450,000 online databases. On this deep Web, myriad databases provide dynamic query-based data access through their query interfaces, instead of static URL links. It is thus essential to integrate these query interfaces for integrating the deep Web. The overall goal of the MetaQuerier project aims at opening up the deep Web to users, by building a system to help users exploring and integrating deep Web sources. In particular, to start with, we focus on the integration of deep Web sources in the same domain, which is itself an important integration task. To automate this integration scenario, we need to solve two critical problems: extracting query interfaces and matching query interfaces. To solve the interface extraction problem, we introduce a parsing paradigm by hypothesizing the existence of hidden syntax which describes the layout and semantic of Web interfaces. Also, unlike traditional pairwise schema matching, we propose a holistic matching approach, which matches all schemas at the same time with the hypothesis of a hidden schema model. Therefore, our techniques explore, in essence, \"data mining for information integration.\" That is, we mine the observable information to discover the underlying semantics.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"303 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133489985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoLag: automatic discovery of lag correlations in stream data","authors":"Yasushi Sakurai, S. Papadimitriou, C. Faloutsos","doi":"10.1109/ICDE.2005.24","DOIUrl":"https://doi.org/10.1109/ICDE.2005.24","url":null,"abstract":"We have introduced the problem of automatic lag correlation detection on streaming data and proposed AutoLag to address this problem by using careful approximations and smoothing. Our experiments on real and realistic data show that AutoLag works as expected, estimating the unknown lags with excellent accuracy and significant speed-up. In our experiments on real and realistic data, AutoLag was up to about 42,000 times faster than the naive implementation, with at most 1% relative error.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130009253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}