{"title":"LotusX: A Position-Aware XML Graphical Search System with Auto-Completion","authors":"Chunbin Lin, Jiaheng Lu, T. Ling, Bogdan Cautis","doi":"10.1109/ICDE.2012.123","DOIUrl":"https://doi.org/10.1109/ICDE.2012.123","url":null,"abstract":"The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, users are required to be familiar with the content (including the structural and textual information) of the hierarchical XML, which is diffcult for common users. The need for designing user friendly interfaces to reduce the burden of query formulation is fundamental to the spreading of XML community. We present a twig-based XML graphical search system, called LotusX, that provides a graphical interface to simplify the query processing without the need of learning query language and data schemas and the knowledge of the content of the XML document. The basic idea is that LotusX proposes \"position-aware\" and \"auto-completion\" features to help users to create tree-modeled queries (twig pattern) by providing the possible candidates on-the-fly. In addition, complex twig queries (including order sensitive queries) are supported in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are implemented to rank and rewrite the query effectively. We provide an online demo for LotusX system: http://datasearch.ruc.edu.cn:8080/LotusX.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124630788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Eisenreich, J. Adamek, Philipp J. Rösch, V. Markl, Gregor Hackenbroich
{"title":"Correlation Support for Risk Evaluation in Databases","authors":"K. Eisenreich, J. Adamek, Philipp J. Rösch, V. Markl, Gregor Hackenbroich","doi":"10.1109/ICDE.2012.30","DOIUrl":"https://doi.org/10.1109/ICDE.2012.30","url":null,"abstract":"Investigating potential dependencies in data and their effect on future business developments can help experts to prevent misestimations of risks and chances. This makes correlation a highly important factor in risk analysis tasks. Previous research on correlation in uncertain data management addressed foremost the handling of dependencies between discrete rather than continuous distributions. Also, none of the existing approaches provides a clear method for extracting correlation structures from data and introducing assumptions about correlation to independently represented data. To enable risk analysis under correlation assumptions, we use an approximation technique based on copula functions. This technique enables analysts to introduce arbitrary correlation structures between arbitrary distributions and calculate relevant measures over thus correlated data. The correlation information can either be extracted at runtime from historic data or be accessed from a parametrically precomputed structure. We discuss the construction, application and querying of approximate correlation representations for different analysis tasks. Our experiments demonstrate the efficiency and accuracy of the proposed approach, and point out several possibilities for optimization.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129233692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun, Hector Gonzalez, Sebastian Seith
{"title":"Multidimensional Analysis of Atypical Events in Cyber-Physical Data","authors":"L. Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun, Hector Gonzalez, Sebastian Seith","doi":"10.1109/ICDE.2012.32","DOIUrl":"https://doi.org/10.1109/ICDE.2012.32","url":null,"abstract":"A Cyber-Physical System (CPS) integrates physical devices (e.g., sensors, cameras) with cyber (or informational) components to form a situation-integrated analytical system that may respond intelligently to dynamic changes of the real-world situations. CPS claims many promising applications, such as traffic observation, battlefield surveillance and sensor-network based monitoring. One important research topic in CPS is about the atypical event analysis, i.e., retrieving the events from large amount of data and analyzing them with spatial, temporal and other multi-dimensional information. Many traditional approaches are not feasible for such analysis since they use numeric measures and cannot describe the complex atypical events. In this study, we propose a new model of atypical cluster to effectively represent those events and efficiently retrieve them from massive data. The micro-cluster is designed to summarize individual events, and the macro-cluster is used to integrate the information from multiple event. To facilitate scalable, flexible and online analysis, the concept of significant cluster is defined and a guided clustering algorithm is proposed to retrieve significant clusters in an efficient manner. We conduct experiments on real datasets with the size of more than 50 GB, the results show that the proposed method can provide more accurate information with only 15% to 20% time cost of the baselines.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123920398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lookup Tables: Fine-Grained Partitioning for Distributed Databases","authors":"Aubrey Tatarowicz, C. Curino, E. Jones, S. Madden","doi":"10.1109/ICDE.2012.26","DOIUrl":"https://doi.org/10.1109/ICDE.2012.26","url":null,"abstract":"The standard way to get linear scaling in a distributed OLTP DBMS is to horizontally partition data across several nodes. Ideally, this partitioning will result in each query being executed at just one node, to avoid the overheads of distributed transactions and allow nodes to be added without increasing the amount of required coordination. For some applications, simple strategies, such as hashing on primary key, provide this property. Unfortunately, for many applications, including social networking and order-fulfillment, many-to-many relationships cause simple strategies to result in a large fraction of distributed queries. Instead, what is needed is a fine-grained partitioning, where related individual tuples (e.g., cliques of friends) are co-located together in the same partition. Maintaining such a fine-grained partitioning requires the database to store a large amount of metadata about which partition each tuple resides in. We call such metadata a lookup table, and present the design of a data distribution layer that efficiently stores these tables and maintains them in the presence of inserts, deletes, and updates. We show that such tables can provide scalability for several difficult to partition database workloads, including Wikipedia, Twitter, and TPC-E. Our implementation provides 40% to 300% better performance on these workloads than either simple range or hash partitioning and shows greater potential for further scale-out.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125675204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Entity Search Strategies for Mashup Applications","authors":"Stefan Endrullis, Andreas Thor, E. Rahm","doi":"10.1109/ICDE.2012.84","DOIUrl":"https://doi.org/10.1109/ICDE.2012.84","url":null,"abstract":"Programmatic data integration approaches such as mashups have become a viable approach to dynamically integrate web data at runtime. Key data sources for mashups include entity search engines and hidden databases that need to be queried via source-specific search interfaces or web forms. Current mashups are typically restricted to simple query approaches such as using keyword search. Such approaches may need a high number of queries if many objects have to be found. Furthermore, the effectiveness of the queries may be limited, i.e., they may miss relevant results. We therefore propose more advanced search strategies that aim at finding a set of entities with high efficiency and high effectiveness. Our strategies use different kinds of queries that are determined by source-specific query generators. Furthermore, the queries are selected based on the characteristics of input entities. We introduce a flexible model for entity search strategies that includes a ranking of candidate queries determined by different query generators. We describe different query generators and outline their use within four entity search strategies. These strategies apply different query ranking and selection approaches to optimize efficiency and effectiveness. We evaluate our search strategies in detail for two domains: product search and publication search. The comparison with a standard keyword search shows that the proposed search strategies provide significant improvements in both domains.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms","authors":"Dan Olteanu, Hongkai Wen","doi":"10.1109/ICDE.2012.61","DOIUrl":"https://doi.org/10.1109/ICDE.2012.61","url":null,"abstract":"In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwise meaningful to the user. Often, users care only about the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this paper, we investigate the problem of ranking query answers in probabilistic databases. We give a dichotomy for ranking in case of conjunctive queries without repeating relation symbols: it is either in polynomial time or NP-hard. Surprisingly, our syntactic characterisation of tractable queries is not the same as for probability computation. The key observation is that there are queries for which probability computation is #P-hard, yet ranking can be computed in polynomial time. This is possible whenever probability computation for distinct answers has a common factor that is hard to compute but irrelevant for ranking. We complement this tractability analysis with an effective ranking technique for conjunctive queries. Given a query, we construct a share plan, which exposes sub queries whose probability computation can be shared or ignored across query answers. Our technique combines share plans with incremental approximate probability computation of sub queries. We implemented our technique in the SPROUT query engine and report on performance gains of orders of magnitude over Monte Carlo simulation using FPRAS and exact probability computation based on knowledge compilation.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"184 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131991305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Numerically Stable Descriptive Statistics in SystemML","authors":"Yuanyuan Tian, S. Tatikonda, B. Reinwald","doi":"10.1109/ICDE.2012.12","DOIUrl":"https://doi.org/10.1109/ICDE.2012.12","url":null,"abstract":"With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129977757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Future of Scientific Data Bases","authors":"M. Stonebraker, A. Ailamaki, J. Kepner, A. Szalay","doi":"10.1109/ICDE.2012.151","DOIUrl":"https://doi.org/10.1109/ICDE.2012.151","url":null,"abstract":"For many decades, users in scientific fields (domain scientists) have resorted to either home-grown tools or legacy software for the management of their data. Technological advancements nowadays necessitate many of the properties such as data independence, scalability, and functionality found in the roadmap of DBMS technology, DBMS products, however, are not yet ready to address scientific application and user needs. Recent efforts toward building a science DBMS indicate that there is a long way ahead of us, paved by a research agenda that is rich in interesting and challenging problems.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Exact Similarity Searches Using Multiple Token Orderings","authors":"Jongik Kim, Hongrae Lee","doi":"10.1109/ICDE.2012.79","DOIUrl":"https://doi.org/10.1109/ICDE.2012.79","url":null,"abstract":"Similarity searches are essential in many applications including data cleaning and near duplicate detection. Many similarity search algorithms first generate candidate records, and then identify true matches among them. A major focus of those algorithms has been on how to reduce the number of candidate records in the early stage of similarity query processing. One of the most commonly used techniques to reduce the candidate size is the prefix filtering principle, which exploits the document frequency ordering of tokens. In this paper, we propose a novel partitioning technique that considers multiple token orderings based on token co-occurrence statistics. Experimental results show that the proposed technique is effective in reducing the number of candidate records and as a result improves the performance of existing algorithms significantly.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124498692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recomputing Materialized Instances after Changes to Mappings and Data","authors":"Todd J. Green, Z. Ives","doi":"10.1109/ICDE.2012.107","DOIUrl":"https://doi.org/10.1109/ICDE.2012.107","url":null,"abstract":"A major challenge faced by today's information systems is that of evolution as data usage evolves or new data resources become available. Modern organizations sometimes exchange data with one another via declarative mappings among their databases, as in data exchange and collaborative data sharing systems. Such mappings are frequently revised and refined as new data becomes available, new cross-reference tables are created, and corrections are made. A fundamental question is how to handle changes to these mapping definitions, when the organizations each materialize the results of applying the mappings to the available data. We consider how to incrementally recompute these database instances in this setting, reusing (if possible) previously computed instances to speed up computation. We develop a principled solution that performs cost-based exploration of recomputation versus reuse, and simultaneously handles updates to source data and mapping definitions through a single, unified mechanism. Our solution also takes advantage of provenance information, when present, to speed up computation even further. We present an implementation that takes advantage of an off-the-shelf DBMS's query processing system, and we show experimentally that our approach provides substantial performance benefits.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128635544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}