P. Pietzuch, J. Ledlie, Jeffrey Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer
{"title":"Network-Aware Operator Placement for Stream-Processing Systems","authors":"P. Pietzuch, J. Ledlie, Jeffrey Shneidman, M. Roussopoulos, M. Welsh, M. Seltzer","doi":"10.1109/ICDE.2006.105","DOIUrl":"https://doi.org/10.1109/ICDE.2006.105","url":null,"abstract":"To use their pool of resources efficiently, distributed stream-processing systems push query operators to nodes within the network. Currently, these operators, ranging from simple filters to custom business logic, are placed manually at intermediate nodes along the transmission path to meet application-specific performance goals. Determining placement locations is challenging because network and node conditions change over time and because streams may interact with each other, opening venues for reuse and repositioning of operators. This paper describes a stream-based overlay network (SBON), a layer between a stream-processing system and the physical network that manages operator placement for stream-processing systems. Our design is based on a cost space, an abstract representation of the network and on-going streams, which permits decentralized, large-scale multi-query optimization decisions. We present an evaluation of the SBON approach through simulation, experiments on PlanetLab, and an integration with Borealis, an existing stream-processing engine. Our results show that an SBON consistently improves network utilization, provides low stream latency, and enables dynamic optimization at low engineering cost.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"122 1","pages":"49-49"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83649866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Actionable Patterns by Role Models","authors":"Ke Wang, Yuelong Jiang, A. Tuzhilin","doi":"10.1109/ICDE.2006.96","DOIUrl":"https://doi.org/10.1109/ICDE.2006.96","url":null,"abstract":"Data mining promises to discover valid and potentially useful patterns in data. Often, discovered patterns are not useful to the user.\"Actionability\" addresses this problem in that a pattern is deemed actionable if the user can act upon it in her favor. We introduce the notion of \"action\" as a domain-independent way to model the domain knowledge. Given a data set about actionable features and an utility measure, a pattern is actionable if it summarizes a population that can be acted upon towards a more promising population observed with a higher utility. We present several pruning strategies taking into account the actionability requirement to reduce the search space, and algorithms for mining all actionable patterns as well as mining the top k actionable patterns. We evaluate the usefulness of patterns and the focus of search on a real-world application domain.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"2 1","pages":"16-16"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84582011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foundations of Automated Database Tuning","authors":"Surajit Chaudhuri, G. Weikum","doi":"10.1109/ICDE.2006.72","DOIUrl":"https://doi.org/10.1109/ICDE.2006.72","url":null,"abstract":"Our society is more dependent on information systems than ever before. However, managing the information systems infrastructure in a cost-effective manner is a growing challenge. The total cost of ownership (TCO) of information technology is increasingly dominated by people costs. In fact, mistakes in operations and administration of information systems are the single most reasons for system outage and unacceptable performance. For information systems to provide value to their customers, we must reduce the complexity associated with their deployment and usage.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"1 1","pages":"104-104"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91297915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhang, Clement T. Yu, N. Smalheiser, Vetle I. Torvik
{"title":"Segmentation of Publication Records of Authors from the Web","authors":"Wei Zhang, Clement T. Yu, N. Smalheiser, Vetle I. Torvik","doi":"10.1109/ICDE.2006.137","DOIUrl":"https://doi.org/10.1109/ICDE.2006.137","url":null,"abstract":"Publication records are often found in the authors’ personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel \"Split and Merge\" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"5 1","pages":"120-120"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90727799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Every Click You Make, IWill Be Fetching It: Efficient XML Query Processing in RDMS Using GUI-driven Prefetching","authors":"S. Bhowmick, Sandeep Prakash","doi":"10.1109/ICDE.2006.64","DOIUrl":"https://doi.org/10.1109/ICDE.2006.64","url":null,"abstract":"formulation and efficient processing of the formulated query. However, due to the nature of XML data, formulating an XML query using an XML query language such as XQuery requires considerable effort. A user must be completely familiar with the syntax of the query language, and must be able to express his/her needs accurately in a syntactically correct form. In many real life applications it is not realistic to assume that users are proficient in expressing such textual queries. Hence, there is a need for a user-friendly visual querying schemes to replace data retrieval aspects of XQuery. In this paper, we address the problem of efficient processing of XQueries in the relational environment where the queries are formulated using a user-friendly GUI. We take a novel and non-traditional approach to improving query performance by prefetching data during the formulation of a query in a single-user environment. The latency offered by the GUI-based query formulation is utilized to prefetch portions of the query results. The basic idea we employ for prefetching is that we prefetch constituent path expressions, store the intermediary results, reuse them when connective is added or \"Run\" is pressed.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"45 1","pages":"152-152"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85086503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams","authors":"Graham Cormode, S. Muthukrishnan, W. Zhuang","doi":"10.1109/ICDE.2006.173","DOIUrl":"https://doi.org/10.1109/ICDE.2006.173","url":null,"abstract":"Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space for highspeed streams of observations, but also crucially, communication constraints for collaborating on the monitoring task. These elements have been addressed in a series of recent works. A fundamental issue that arises is that one cannot make the \"uniqueness\" assumption on observed events which is present in previous works, since widescale monitoring invariably encounters the same events at different points. For example, within the network of an Internet Service Provider packets of the same flow will be observed in different routers; similarly, the same individual will be observed by multiple mobile sensors in monitoring wild animals. Aggregates of interest on such distributed environments must be resilient to duplicate observations. We study such duplicate-resilient aggregates that measure the extent of the duplication―how many unique observations are there, how many observations are unique―as well as standard holistic aggregates such as quantiles and heavy hitters over the unique items. We present accuracy guaranteed, highly communication-efficient algorithms for these aggregates that work within the time and space constraints of high speed streams. We also present results of a detailed experimental study on both real-life and synthetic data.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"7 1","pages":"57-57"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85632651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haixun Wang, Hao He, Jun Yang, Philip S. Yu, J. Yu
{"title":"Dual Labeling: Answering Graph Reachability Queries in Constant Time","authors":"Haixun Wang, Hao He, Jun Yang, Philip S. Yu, J. Yu","doi":"10.1109/ICDE.2006.53","DOIUrl":"https://doi.org/10.1109/ICDE.2006.53","url":null,"abstract":"Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only. For sparse graphs, 2-hop based reachability labeling schemes answer reachability queries efficiently using relatively small label space. However, the labeling process itself is often too time consuming to be practical for large graphs. In this paper, we propose a novel labeling scheme for sparse graphs. Our scheme ensures that graph reachability queries can be answered in constant time. Furthermore, for sparse graphs, the complexity of the labeling process is almost linear, which makes our algorithm applicable to massive datasets. Analytical and experimental results show that our approach is much more efficient than stateof- the-art approaches. Furthermore, our labeling method also provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"26 1","pages":"75-75"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84739777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIPPER: Selecting Informative Peers in Structured P2P Environment for Content-Based Retrieval","authors":"Shuigeng Zhou, Zhengjie Zhang, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2006.139","DOIUrl":"https://doi.org/10.1109/ICDE.2006.139","url":null,"abstract":"In this demonstration, we present a prototype system called SIPPER, which is the abbreviation for Selecting Informative Peers in Structured P2P Environment for Content-based Retrieval. SIPPER distinguishes itself from the existing P2P-IR systems by the following two features: First, to improve retrieval efficiency, SIPPER employs a novel peer selection method to direct the query to a small fraction of relevant peers in the network for searching globally relevant documents. Second, to reduce the bandwidth cost of meta data publishing, SIPPER uses a new publishing mechanism, the term-node publishing mechanism, which is different from the traditional term-document model [2].","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"23 1","pages":"161-161"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85779219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoming Jin, Xinqiang Zuo, K. Lam, Jianmin Wang, Jiaguang Sun
{"title":"Efficient Discovery of Emerging Frequent Patterns in ArbitraryWindows on Data Streams","authors":"Xiaoming Jin, Xinqiang Zuo, K. Lam, Jianmin Wang, Jiaguang Sun","doi":"10.1109/ICDE.2006.57","DOIUrl":"https://doi.org/10.1109/ICDE.2006.57","url":null,"abstract":"This paper proposes an effective data mining technique for finding useful patterns in streaming sequences. At present, typical approaches to this problem are to search for patterns in a fixed-size window sliding through the stream of data being collected. The practical values of such approaches are limited in that, in typical application scenarios, the patterns are emerging and it is difficult, if not impossible, to determine a priori a suitable window size within which useful patterns may exist. It is therefore desirable to devise techniques that can identify useful patterns with arbitrary window sizes. Attempts to this problem are challenging, however, because it requires a highly efficient searching in a substantially bigger solution space. This paper presents a new method which includes firstly a pruning strategy to reduce the search space and secondly a mining strategy that adopts a dynamic index structure to allow efficient discovery of emerging patterns in a streaming sequence. Experimental results on real data and synthetic data show that the proposed method outperforms other existing schemes both in computational efficiency and effectiveness in finding useful patterns.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"18 1","pages":"113-113"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87089894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ganesh Ramakrishnan, Sachindra Joshi, Sumit Negi, R. Krishnapuram, S. Balakrishnan
{"title":"Automatic Sales Lead Generation from Web Data","authors":"Ganesh Ramakrishnan, Sachindra Joshi, Sumit Negi, R. Krishnapuram, S. Balakrishnan","doi":"10.1109/ICDE.2006.28","DOIUrl":"https://doi.org/10.1109/ICDE.2006.28","url":null,"abstract":"Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the decision making process of a purchase, the higher are the chances of converting that prospect into a customer. Traditional methods to identify sales leads such as company surveys and direct marketing are manual, expensive and not scalable. Over the past decade the World Wide Web has grown into an information-mesh, with most important facts being reported through Web sites. Several news papers, press releases, trade journals, business magazines and other related sources are on-line. These sources could be used to identify prospective buyers automatically. In this paper, we present a system called ETAP (Electronic Trigger Alert Program) that extracts trigger events from Web data that help in identifying prospective buyers. Trigger events are events of corporate relevance and indicative of the propensity of companies to purchase new products associated with these events. Examples of trigger events are change in management, revenue growth and mergers & acquisitions. The unstructured nature of information makes the extraction task of trigger events difficult. We pose the problem of trigger events extraction as a classification problem and develop methods for learning trigger event classifiers using existing classification methods. We present methods to automatically generate the training data required to learn the classifiers. We also propose a method of feature abstraction that uses named entity recognition to solve the problem of data sparsity. We score and rank the trigger events extracted from ETAP for easy browsing. Our experiments show the effectiveness of the method and thus establish the feasibility of automatic sales lead generation using the Web data.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"67 1","pages":"101-101"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90276794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}