{"title":"Proactive identification of performance problems","authors":"S. Duan, S. Babu","doi":"10.1145/1142473.1142582","DOIUrl":"https://doi.org/10.1145/1142473.1142582","url":null,"abstract":"We propose to demonstrate Fa, an automated tool for timely and accurate prediction of Service-Level-Agreement (SLA) violations caused by performance problems in database systems. Fa periodically collects performance data at three levels: applications, database server, and operating system. This data is used to construct probabilistic models for predicting SLA violations. Fa currently uses graphical Bayesian network models because of their ability to support a wide range of inferences, including prediction and diagnosis, as well as their support for interactive visualization and presentation of complex system behavior in intuitive ways.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121360873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximately detecting duplicates for streaming data using stable bloom filters","authors":"Fan Deng, Davood Rafiei","doi":"10.1145/1142473.1142477","DOIUrl":"https://doi.org/10.1145/1142473.1142477","url":null,"abstract":"Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115477326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Doan, R. Ramakrishnan, Shivakumar Vaithyanathan
{"title":"Managing information extraction: state of the art and research directions","authors":"A. Doan, R. Ramakrishnan, Shivakumar Vaithyanathan","doi":"10.1145/1142473.1142595","DOIUrl":"https://doi.org/10.1145/1142473.1142595","url":null,"abstract":"This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text). We first survey research on information extraction in the database, AI, NLP, IR, and Web communities in recent years. Then we discuss why this is the right time for the database community to actively participate and address the problem of managing information extraction (including in particular the challenges of maintaining and querying the extracted information, and accounting for the imprecision and uncertainty inherent in the extraction process). Finally, we show how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools. We do not assume prior knowledge of text management, NLP, extraction techniques, or machine learning.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127536994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting ad-hoc ranking aggregates","authors":"Chengkai Li, K. Chang, I. Ilyas","doi":"10.1145/1142473.1142481","DOIUrl":"https://doi.org/10.1145/1142473.1142481","url":null,"abstract":"This paper presents a principled framework for efficient processing of ad-hoc top-k (ranking) aggregate queries, which provide the k groups with the highest aggregates as results. Essential support of such queries is lacking in current systems, which process the queries in a naïve materialize-group-sort scheme that can be prohibitively inefficient. Our framework is based on three fundamental principles. The Upper-Bound Principle dictates the requirements of early pruning, and the Group-Ranking and Tuple-Ranking Principles dictate group-ordering and tuple-ordering requirements. They together guide the query processor toward a provably optimal tuple schedule for aggregate query processing. We propose a new execution framework to apply the principles and requirements. We address the challenges in realizing the framework and implementing new query operators, enabling efficient group-aware and rank-aware query plans. The experimental study validates our framework by demonstrating orders of magnitude performance improvement in the new query plans, compared with the traditional plans.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127589785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based synopses for relational selectivity estimation","authors":"Joshua Spiegel, N. Polyzotis","doi":"10.1145/1142473.1142497","DOIUrl":"https://doi.org/10.1145/1142473.1142497","url":null,"abstract":"This paper introduces the Tuple Graph (TUG) synopses, a new class of data summaries that enable accurate selectivity estimates for complex relational queries. The proposed summarization framework adopts a \"semi-structured\" view of the relational database, modeling a relational data set as a graph of tuples and join queries as graph traversals respectively. The key idea is to approximate the structure of the induced data graph in a concise synopsis, and to estimate the selectivity of a query by performing the corresponding traversal over the summarized graph. We detail the TUG synopsis model that is based on this novel approach, and we describe an efficient and scalable construction algorithm for building accurate TUGs within a specific storage budget. We validate the performance of TUGs with an extensive experimental study on real-life and synthetic data sets. Our results verify the effectiveness of TUGs in generating accurate selectivity estimates for complex join queries, and demonstrate their benefits over existing summarization techniques.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127016803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Database support for matching: limitations and opportunities","authors":"A. Kini, S. Shankar, J. Naughton, D. DeWitt","doi":"10.1145/1142473.1142484","DOIUrl":"https://doi.org/10.1145/1142473.1142484","url":null,"abstract":"We define a match join of R and S with predicate θ to be a subset of the θ-join of R and S such that each tuple of R and S contributes to at most one result tuple. Match joins and their generalizations belong to a broad class of matching problems that have attracted a great deal of attention in disciplines including operations research and theoretical computer science. Instances of these problems arise in practice in resource allocation scenarios. To the best of our knowledge no one uses an RDBMS as a tool to help solve these problems; our goal in this paper is to explore whether or not this needs to be the case. We show that the simple approach of computing the full θ-join and then applying standard graph-matching algorithms to the result is ineffective for all but the smallest of problem instances. By contrast, a closer study shows that the DBMS primitives of grouping, sorting, and joining can be exploited to yield efficient match join operations. This suggests that RDBMSs can play a role in matching related problems beyond merely serving as expensive file systems exporting data sets to external user programs.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122569579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, L. Gravano
{"title":"To search or to crawl?: towards a query optimizer for text-centric tasks","authors":"Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, L. Gravano","doi":"10.1145/1142473.1142504","DOIUrl":"https://doi.org/10.1145/1142473.1142504","url":null,"abstract":"Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl,\" the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output \"completeness\" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134227609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heasoo Hwang, Vagelis Hristidis, Y. Papakonstantinou
{"title":"ObjectRank: a system for authority-based search on databases","authors":"Heasoo Hwang, Vagelis Hristidis, Y. Papakonstantinou","doi":"10.1145/1142473.1142593","DOIUrl":"https://doi.org/10.1145/1142473.1142593","url":null,"abstract":"We present ObjectRank demo system that performs authority-based keyword search on bibliographic databases. We also provide Inverse ObjectRank as a keyword-specific specificity metric and other calibration parameters such as Global ObjectRank. Users can specify various combinations of calibration values to control the behavior of the demo. Finally, we propose a methodology that enables us to extend query results using the ontology graph.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133997972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"User-defined aggregate functions: bridging theory and practice","authors":"Sara Cohen","doi":"10.1145/1142473.1142480","DOIUrl":"https://doi.org/10.1145/1142473.1142480","url":null,"abstract":"The ability to create user-defined aggregate functions (UDAs) is rapidly becoming a standard feature in relational database systems. Therefore, problems such as query optimization, query rewriting and view maintenance must take into account queries (or views) with UDAs. There is a wealth of research on these problems for queries with general aggregate functions. Unfortunately, there is a mismatch between the manner in which UDAs are created, and the information that the database system requires in order to apply previous research.The purpose of this paper is to explore this mismatch and to bridge the gap between theory and practice, thereby enabling UDAs to become first-class citizens within the database. Specifically, we consider query optimization, query rewriting and view maintenance for queries with UDAs. For each of these problems we first survey previous results and explore the mismatch between theory and practice. We then present theoretical and practical insights that can be combined to derive a coherent framework for defining UDAs within a database system.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"729 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134127122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LINQ: reconciling object, relations and XML in the .NET framework","authors":"E. Meijer, B. Beckman, G. Bierman","doi":"10.1145/1142473.1142552","DOIUrl":"https://doi.org/10.1145/1142473.1142552","url":null,"abstract":"Many software applications today need to handle data from different data models; typically objects from the host programming language along with the relational and XML data models. The ROX impedance mismatch makes programs awkward to write and hard to maintain.The .NET Language-Integrated Query (LINQ) framework, proposed for the next release of the .NET framework, approaches this problem by defining a pattern of general-purpose standard query operators for traversal, filter, and projection. Based on this pattern, any .NET language can define special query comprehension syntax that is subsequently compiled into these standard operators (our code examples are in VB).Besides the general query operators, the LINQ framework also defines two domain specific APIs that work over XML (XLinq) and relational data (DLinq) respectively. The operators over XML use a lightweight and easy to use in-memory XML representation to provide XQuery-style expressiveness in the host programming language. The operators over relational data provide a simple OR mapping by leveraging remotable queries that are executed directly in the back-end relational store.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"321 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131357138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}