{"title":"Session details: Research session 19: semi-structured data management","authors":"J. Shanmugasundaram","doi":"10.1145/3257467","DOIUrl":"https://doi.org/10.1145/3257467","url":null,"abstract":"","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129888011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David E. Simmen, Frederick Reiss, Yunyao Li, Suresh Thalamati
{"title":"Enabling enterprise mashups over unstructured text feeds with InfoSphere MashupHub and SystemT","authors":"David E. Simmen, Frederick Reiss, Yunyao Li, Suresh Thalamati","doi":"10.1145/1559845.1559999","DOIUrl":"https://doi.org/10.1145/1559845.1559999","url":null,"abstract":"Enterprise mashup scenarios often involve feeds derived from data created primarily for eye consumption, such as email, news, calendars, blogs, and web feeds. These data sources can test the capabilities of current data mashup products, as the attributes needed to perform join, aggregation, and other operations are often buried within unstructured feed text. Information extraction technology is a key enabler in such scenarios, using annotators to convert unstructured text into structured information that can facilitate mashup operations. Our demo presents the integration of SystemT, an information extraction system from IBM Research, with IBM's InfoSphere MashupHub. We show how to build domain-specific annotators with SystemT's declarative rule language, AQL, and how to use these annotators to combine structured and unstructured information in an enterprise mashup.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125301835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRIMA: archiving and querying historical data with evolving schemas","authors":"H. J. Moon, C. Curino, MyungWon Ham, C. Zaniolo","doi":"10.1145/1559845.1559970","DOIUrl":"https://doi.org/10.1145/1559845.1559970","url":null,"abstract":"Schema evolution poses serious challenges in historical data management. Traditionally, historical data have been archived either by (i) migrating them into the current schema version that is well-understood by users but compromising archival quality, or (ii) by maintaining them under the original schema version in which the data was originally created, leading to perfect archival quality, but forcing users to formulate queries against complex histories of evolving schemas. In the PRIMA system, we achieve the best of both approaches, by (i) archiving historical data under the schema version under which they were originally created, and (ii) letting users express temporal queries using the current schema version. Thus, in PRIMA, the system rewrites the queries to the (potentially many) pertinent versions of the evolving schema. Moreover, the system o ers automatic documentation of the schema history, and allows the users to pose temporal queries over the metadata history itself. The proposed demonstration highlights the system features exploiting both a synthetic-educational running example and the real-life evolution histories (schemas and data), which include hundreds of schema versions from Wikipedia and Ensembl. The demonstration off ers a thorough walk-through of the system features and a hands-on system testing phase, where the audiences are invited to directly interact with the advanced query interface of PRIMA.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simplifying XML schema: effortless handling of nondeterministic regular expressions","authors":"G. Bex, W. Gelade, W. Martens, F. Neven","doi":"10.1145/1559845.1559922","DOIUrl":"https://doi.org/10.1145/1559845.1559922","url":null,"abstract":"Whether beloved or despised, XML Schema is momentarily the only industrially accepted schema language for XML and is unlikely to become obsolete any time soon. Nevertheless, many nontransparent restrictions unnecessarily complicate the design of XSDs. For instance, complex content models in XML Schema are constrained by the infamous unique particle attribution (UPA) constraint. In formal language theoretic terms, this constraint restricts content models to deterministic regular expressions. As the latter constitute a semantic notion and no simple corresponding syntactical characterization is known, it is very difficult for non-expert users to understand exactly when and why content models do or do not violate UPA. In the present paper, we therefore investigate solutions to relieve users from the burden of UPA by automatically transforming nondeterministic expressions into concise deterministic ones defining the same language or constituting good approximations. The presented techniques facilitate XSD construction by reducing the design task at hand more towards the complexity of the modeling task. In addition, our algorithms can serve as a plug-in for any model management tool which supports export to XML Schema format.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127642162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keyword search on structured and semi-structured data","authors":"Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin","doi":"10.1145/1559845.1559966","DOIUrl":"https://doi.org/10.1145/1559845.1559966","url":null,"abstract":"Empowering users to access databases using simple keywords can relieve the users from the steep learning curve of mastering a structured query language and understanding complex and possibly fast evolving data schemas. In this tutorial, we give an overview of the state-of-the-art techniques for supporting keyword search on structured and semi-structured data, including query result definition, ranking functions, result generation and top-k query processing, snippet generation, result clustering, query cleaning, performance optimization, and search quality evaluation. Various data models will be discussed, including relational data, XML data, graph-structured data, data streams, and workflows. We also discuss applications that are built upon keyword search, such as keyword based database selection, query generation, and analytical processing. Finally we identify the challenges and opportunities of future research to advance the field.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130019947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Search your memory ! - an associative memory based desktop search system","authors":"Jidong Chen, Hang Guo, Wentao Wu, Chunxin Xie","doi":"10.1145/1559845.1559992","DOIUrl":"https://doi.org/10.1145/1559845.1559992","url":null,"abstract":"We present XSearcher, an associative memory based desktop search system, which exploits associations by creating semantic links of personal desktop resources from explicit and implicit user activities. With these links, associations among memory fragments can be built or rebuilt in a user's brain during a search. The personalized ranking scheme uses these links together with a user's personal preferences to rank results by both relevance and importance. XSearcher enhances traditional keyword based search systems since it is closer to the way that human associative memory works.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130180239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bryan Chan, Justin Talbot, Leslie Wu, Nathan Sakunkoo, Mike Cammarano, P. Hanrahan
{"title":"Vispedia: on-demand data integration for interactive visualization and exploration","authors":"Bryan Chan, Justin Talbot, Leslie Wu, Nathan Sakunkoo, Mike Cammarano, P. Hanrahan","doi":"10.1145/1559845.1560003","DOIUrl":"https://doi.org/10.1145/1559845.1560003","url":null,"abstract":"Wikipedia is an example of the large, collaborative, semi-structured data sets emerging on the Web. Typically, before these data sets can be used, they must transformed into structured tables via data integration. We present Vispedia, a Web-based visualization system which incorporates data integration into an iterative, interactive data exploration and analysis process. This reduces the upfront cost of using heterogeneous data sets like Wikipedia. Vispedia is driven by a keyword-query-based integration interface implemented using a fast graph search. The search occurs interactively over DBpedia's semantic graph of Wikipedia, without depending on the existence of a structured ontology. This combination of data integration and visualization enables a broad class of non-expert users to more effectively use the semi-structured data available on the Web.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127895086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient type-ahead search on relational data: a TASTIER approach","authors":"Guoliang Li, S. Ji, Chen Li, Jianhua Feng","doi":"10.1145/1559845.1559918","DOIUrl":"https://doi.org/10.1145/1559845.1559918","url":null,"abstract":"Existing keyword-search systems in relational databases require users to submit a complete query to compute answers. Often users feel \"left in the dark\" when they have limited knowledge about the data, and have to use a try-and-see approach for modifying queries and finding answers. In this paper we propose a novel approach to keyword search in the relational world, called Tastier. A Tastier system can bring instant gratification to users by supporting type-ahead search, which finds answers \"on the fly\" as the user types in query keywords. A main challenge is how to achieve a high interactive speed for large amounts of data in multiple tables, so that a query can be answered efficiently within milliseconds. We propose efficient index structures and algorithms for finding relevant answers on-the-fly by joining tuples in the database. We devise a partition-based method to improve query performance by grouping highly relevant tuples and pruning irrelevant tuples efficiently. We also develop a technique to answer a query efficiently by predicting the highly relevant complete queries for the user. We have conducted a thorough experimental evaluation of the proposed techniques on real data sets to demonstrate the efficiency and practicality of this new search paradigm.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121182796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Riham Abdel Kader, P. Boncz, S. Manegold, M. van Keulen
{"title":"ROX: run-time optimization of XQueries","authors":"Riham Abdel Kader, P. Boncz, S. Manegold, M. van Keulen","doi":"10.1145/1559845.1559910","DOIUrl":"https://doi.org/10.1145/1559845.1559910","url":null,"abstract":"Optimization of complex XQueries combining many XPath steps and joins is currently hindered by the absence of good cardinality estimation and cost models for XQuery. Additionally, the state-of-the-art of even relational query optimization still struggles to cope with cost model estimation errors that increase with plan size, as well as with the effect of correlated joins and selections. In this research, we propose to radically depart from the traditional path of separating the query compilation and query execution phases, by having the optimizer execute, materialize partial results, and use sampling based estimation techniques to observe the characteristics of intermediates. The proposed technique takes as input a Join Graph where the edges are either equi-joins or XPath steps, and the execution environment provides value- and structural-join algorithms, as well as structural and value-based indices. While run-time optimization with sampling removes many of the vulnerabilities of classical optimizers, it brings its own challenges with respect to keeping resource usage under control, both with respect to the materialization of intermediates, as well as the cost of plan exploration using sampling. Our approach deals with these issues by limiting the run-time search space to so-called \"zero-investment algorithms for which sampling can be guaranteed to be strictly linear in sample size. All operators and XML value indices used by ROX for sampling have the zero-investment property. We perform extensive experimental evaluation on large XML datasets that shows that our run-time query optimizer finds good query plans in a robust fashion and has limited run-time overhead.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116393487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahram Ghandeharizadeh, A. Goodney, Chetan Sharma, Chris Bissell, Felipe Cariño, Naveen Nannapaneni, Alex Wergeles, Aber Whitcomb
{"title":"Taming the storage dragon: the adventures of hoTMaN","authors":"Shahram Ghandeharizadeh, A. Goodney, Chetan Sharma, Chris Bissell, Felipe Cariño, Naveen Nannapaneni, Alex Wergeles, Aber Whitcomb","doi":"10.1145/1559845.1559949","DOIUrl":"https://doi.org/10.1145/1559845.1559949","url":null,"abstract":"HoTMaN (HoT-standby MaNager) is a joint project between MySpace and USC Database Laboratory to design and develop a tool to ensure a 24x7 up-time and ease administration of Terabytes of storage that sits underneath hundreds of database servers. The HoTMaN tool's innovation and uniqueness is that it can, with a few clicks, perform operational tasks that require hundreds of keyboard strokes by \"trusted trained\" experts. With HoTMaN, MySpace can within minutes migrate the relational database(s) of a failed server to a hot-standby. A process that could take over 1 hour and had a high potential for human error is now performed reliably. A database internal to HoTMaN captures all virtual disks, volume and file configurations associated with each SQL Server and candidate hot-standby servers where SQL server processing could be migrated. HoTMaN is deployed in production and its current operational benefits include: (i) enhanced availability of data, and (ii) planned maintenance and patching.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122565703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}