A. Ghazal, Dawit Yimam Seid, Ramesh Bhashyam, A. Crolotte, Manjula Koppuravuri, G. Vinod
{"title":"Dynamic plan generation for parameterized queries","authors":"A. Ghazal, Dawit Yimam Seid, Ramesh Bhashyam, A. Crolotte, Manjula Koppuravuri, G. Vinod","doi":"10.1145/1559845.1559946","DOIUrl":"https://doi.org/10.1145/1559845.1559946","url":null,"abstract":"Query processing in a DBMS typically involves two distinct phases: compilation, which generates the best plan and its corresponding execution steps, and execution, which evaluates these steps against database objects. For some queries, considerable resource savings can be achieved by skipping the compilation phase when the same query was previously submitted and its plan was already cached. In a number of important applications the same query, called a Parameterized Query (PQ), is repeatedly submitted in the same basic form but with different parameter values. PQ's are extensively used in both data update (e.g. batch update programs) and data access queries. There are tradeoffs associated with caching and re-using query plans such as space utilization and maintenance cost. Besides, pre-compiled plans may be suboptimal for a particular execution due to various reasons including data skew and inability to exploit value-based query transformation like materialized view rewrite and unsatisfiable predicate elimination. We address these tradeoffs by distinguishing two types of plans for PQ's: generic and specific plans. Generic plans are pre-compiled plans that are independent of the actual parameter values. Prior to execution, parameter values are plugged in to generic plans. In specific plans, parameter values are plugged prior to the compilation phase. This paper provides a practical framework for dynamically deciding between specific and generic plans for PQ's based on a mix of rule and cost based heuristics which are implemented in the Teradata 12.0 DBMS.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117323402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top-k generation of integrated schemas based on directed and weighted correspondences","authors":"A. Radwan, Lucian Popa, I. Stanoi, A. Younis","doi":"10.1145/1559845.1559913","DOIUrl":"https://doi.org/10.1145/1559845.1559913","url":null,"abstract":"Schema integration is the problem of creating a unified target schema based on a set of existing source schemas and based on a set of correspondences that are the result of matching the source schemas. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. In this paper, we propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a novel top-k ranking algorithm for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115560415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marios Hadjieleftheriou, Nick Koudas, D. Srivastava
{"title":"Incremental maintenance of length normalized indexes for approximate string matching","authors":"Marios Hadjieleftheriou, Nick Koudas, D. Srivastava","doi":"10.1145/1559845.1559891","DOIUrl":"https://doi.org/10.1145/1559845.1559891","url":null,"abstract":"Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"2010 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114207365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hicham G. Elmongui, Vivek R. Narasayya, Ravishankar Ramamurthy
{"title":"A framework for testing query transformation rules","authors":"Hicham G. Elmongui, Vivek R. Narasayya, Ravishankar Ramamurthy","doi":"10.1145/1559845.1559874","DOIUrl":"https://doi.org/10.1145/1559845.1559874","url":null,"abstract":"In order to enable extensibility, modern query optimizers typically leverage a transformation rule based framework. Testing individual rule correctness as well as correctness of rule interactions is crucial in verifying the functionality of a query optimizer. While there has been a lot of work on how to architect optimizers for extensibility using a rule based framework, there has been relatively little work on how to test such optimizers. In this paper we present a framework for testing query transformation rules which enables: (a) efficient generation of queries that exercise a particular transformation rule or a set of rules and (b) efficient execution of corresponding test suites for correctness testing.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fei Chen, Byron J. Gao, A. Doan, Jun Yang, R. Ramakrishnan
{"title":"Optimizing complex extraction programs over evolving text data","authors":"Fei Chen, Byron J. Gao, A. Doan, Jun Yang, R. Ramakrishnan","doi":"10.1145/1559845.1559881","DOIUrl":"https://doi.org/10.1145/1559845.1559881","url":null,"abstract":"Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE ``blackbox.'' In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional ``workflow.'' In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123358867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tengjiao Wang, Bishan Yang, Jun Gao, Dongqing Yang, Shiwei Tang, Haoyu Wu, Kedong Liu, J. Pei
{"title":"MobileMiner: a real world case study of data mining in mobile communication","authors":"Tengjiao Wang, Bishan Yang, Jun Gao, Dongqing Yang, Shiwei Tang, Haoyu Wu, Kedong Liu, J. Pei","doi":"10.1145/1559845.1559988","DOIUrl":"https://doi.org/10.1145/1559845.1559988","url":null,"abstract":"Mobile communication data analysis has been often used as a background application to motivate many data mining problems. However, very few data mining researchers have a chance to see a working data mining system on real mobile communication data. In this demo, we showcase our new system MobileMiner on a real mobile communication data set, which presents a case study of business solutions using state-of-the-art data mining techniques. MobileMiner adaptively profiles users' behavior from their calling and moving record streams. Customer segmentation and social community analysis can be conducted based on user profiles. We show how data mining techniques can help in mobile communication data analysis. Moreover, we also show some interesting observations which still cannot be mined by the current techniques, and thus may motivate new research and development.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126680412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Special invited session on systems research and information management","authors":"M. Carey","doi":"10.1145/3257477","DOIUrl":"https://doi.org/10.1145/3257477","url":null,"abstract":"","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124862079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top-k queries on uncertain data: on score distribution and typical answers","authors":"Tingjian Ge, S. Zdonik, S. Madden","doi":"10.1145/1559845.1559886","DOIUrl":"https://doi.org/10.1145/1559845.1559886","url":null,"abstract":"Uncertain data arises in a number of domains, including data integration and sensor networks. Top-k queries that rank results according to some user-defined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs between reporting high-scoring tuples and tuples with a high probability of being in the resulting data set. In this paper, we demonstrate the need to present the score distribution of top-k vectors to allow the user to choose between results along this score-probability dimensions. One option would be to display the complete distribution of all potential top-k tuple vectors, but this set is too large to compute. Instead, we propose to provide a number of typical vectors that effectively sample this distribution. We propose efficient algorithms to compute these vectors. We also extend the semantics and algorithms to the scenario of score ties, which is not dealt with in the previous work in the area. Our work includes a systematic empirical study on both real dataset and synthetic datasets.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122064920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research session 9: data on the web","authors":"L. Gravano","doi":"10.1145/3257457","DOIUrl":"https://doi.org/10.1145/3257457","url":null,"abstract":"","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125703117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data warehouse technology by infobright","authors":"D. Ślęzak, Victoria Eastwood","doi":"10.1145/1559845.1559933","DOIUrl":"https://doi.org/10.1145/1559845.1559933","url":null,"abstract":"We discuss Infobright technology with respect to its main features and architectural differentiators. We introduce the upcoming research and development projects that may be of special interest to the academic and industry communities.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133947262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}