{"title":"Reconciling while tolerating disagreement in collaborative data sharing","authors":"Nicholas E. Taylor, Z. Ives","doi":"10.1145/1142473.1142476","DOIUrl":"https://doi.org/10.1145/1142473.1142476","url":null,"abstract":"In many data sharing settings, such as within the biological and biomedical communities, global data consistency is not always attainable: different sites' data may be dirty, uncertain, or even controversial. Collaborators are willing to share their data, and in many cases they also want to selectively import data from others --- but must occasionally diverge when they disagree about uncertain or controversial facts or values. For this reason, traditional data sharing and data integration approaches are not applicable, since they require a globally consistent data instance. Additionally, many of these approaches do not allow participants to make updates; if they do, concurrency control algorithms or inconsistency repair techniques must be used to ensure a consistent view of the data for all users.In this paper, we develop and present a fully decentralized model of collaborative data sharing, in which participants publish their data on an ad hoc basis and simultaneously reconcile updates with those published by others. Individual updates are associated with provenance information, and each participant accepts only updates with a sufficient authority ranking, meaning that each participant may have a different (though conceptually overlapping) data instance. We define a consistency semantics for database instances under this model of disagreement, present algorithms that perform reconciliation for distributed clusters of participants, and demonstrate their ability to handle typical update and conflict loads in settings involving the sharing of curated data.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133693920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Zhang, Seung-won Hwang, K. Chang, Min Wang, Christian A. Lang, Yuan-Chi Chang
{"title":"Boolean + ranking: querying a database by k-constrained optimization","authors":"Zhen Zhang, Seung-won Hwang, K. Chang, Min Wang, Christian A. Lang, Yuan-Chi Chang","doi":"10.1145/1142473.1142515","DOIUrl":"https://doi.org/10.1145/1142473.1142515","url":null,"abstract":"The wide spread of databases for managing structured data, compounded with the expanded reach of the Internet, has brought forward interesting data retrieval and analysis scenarios to RDBMS. In such settings, queries often take the form of k-constrained optimization, with a Boolean constraint and a numeric optimization expression as the goal function, retrieving only the top-k tuples. This paper proposes the concept of supporting such queries, as their nature implies, by a functional optimization machinery over the search space of multiple indices. To realize this concept, we combine the dual perspectives of discrete state search (from the view of indices) and continuous function optimization (from the view of goal functions). We present, as the marriage of the two perspectives, the OPT* framework, which encodes k-constrained optimization as an A* search over the composite space of multiple indices, driven by functional optimization for providing tight heuristics. By processing queries as optimization, OPT* significantly outperforms baseline approaches, with up to 3 orders of magnitude margins.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122259614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-efficient monitoring of extreme values in sensor networks","authors":"Adam Silberstein, Kamesh Munagala, Jun Yang","doi":"10.1145/1142473.1142493","DOIUrl":"https://doi.org/10.1145/1142473.1142493","url":null,"abstract":"Monitoring extreme values (MAX or MIN) is a fundamental problem in wireless sensor networks (and in general, complex dynamic systems). This problem presents very different algorithmic challenges from aggregate and selection queries, in the sense that an individual node cannot by itself determine its inclusion in the query result. We present novel query processing algorithms for this problem, with the goal of minimizing message traffic in the network. These algorithms employ a hierarchy of local constraints, or thresholds, to leverage network topology such that message-passing is localized. We evaluate all algorithms using simulated and real-world data to study various trade-offs.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128772052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using SPIDER: an experience report","authors":"Nick Koudas, A. Marathe, D. Srivastava","doi":"10.1145/1142473.1142557","DOIUrl":"https://doi.org/10.1145/1142473.1142557","url":null,"abstract":"At AT&T Labs-Research, we have been developing a prototype system called SPIDER to efficiently support flexible string matching of attribute values in large databases. SPIDER has been used in AT&T, both as a key component of an operational portal for matching customer names and addresses, and for a variety of ad hoc data quality analyses. In this talk, we report on experiences with SPIDER.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115746674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the database/network interface in large-scale publish/subscribe systems","authors":"B. Chandramouli, Junyi Xie, Jun Yang","doi":"10.1145/1142473.1142539","DOIUrl":"https://doi.org/10.1145/1142473.1142539","url":null,"abstract":"The work performed by a publish/subscribe system can conceptually be divided into subscription processing and notification dissemination. Traditionally, research in the database and networking communities has focused on these aspects in isolation. The interface between the database server and the network is often overlooked by previous research. At one extreme, database servers are directly responsible for notifying individual subscribers; at the other extreme, updates are injected directly into the network, and the network is solely responsible for processing subscriptions and forwarding notifications. These extremes are unsuitable for complex and stateful subscription queries. A primary goal of this paper is to explore the design space between the two extremes, and to devise solutions that incorporate both database-side and network-side considerations in order to reduce the communication and server load and maintain system scalability. Our techniques apply to a broad range of stateful query types, and we present solutions for several of them. Our detailed experiments based on real and synthetic workloads with varying characteristics and link-level network simulation show that by exploiting the query semantics and building an appropriate interface between the database and the network, it is possible to achieve orders-of-magnitude savings in network traffic at low server-side processing cost.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128365211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Documentum ECI self-repairing wrappers: performance analysis","authors":"Boris Chidlovskii, Bruno Roustant, Marc Brette","doi":"10.1145/1142473.1142555","DOIUrl":"https://doi.org/10.1145/1142473.1142555","url":null,"abstract":"Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. It offers a unique frame-work of wrapper production, automatic recovery and maintenance, developed at Xerox Research Centre Europe and based on state-of-art algorithms from machine learning and grammatical inference. In this presentation we analyze the performance of ECI adapters deployed in current commercial installations. We benefit from accessing reports on daily tests for all ECI commercially deployed adapters collected from June 2003 to September 2005. Using the daily reports, we analyze different aspects of the wrapper technology.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128465319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A non-linear dimensionality-reduction technique for fast similarity search in large databases","authors":"Khanh Vu, K. Hua, Hao Cheng, S. Lang","doi":"10.1145/1142473.1142532","DOIUrl":"https://doi.org/10.1145/1142473.1142532","url":null,"abstract":"To enable efficient similarity search in large databases, many indexing techniques use a linear transformation scheme to reduce dimensions and allow fast approximation. In this reduction approach the approximation is unbounded, so that the approximation volume extends across the dataspace. This causes over-estimation of retrieval sets and impairs performance.This paper presents a non-linear transformation scheme that extracts two important parameters specifying the data. We prove that these parameters correspond to a bounded volume around the search sphere, irrespective of dimensionality. We use a special workspace-mapping mechanism to derive tight bounds for the parameters and to prove further results, as well as highlighting insights into the problems and our proposed solutions. We formulate a measure that lower-bounds the Euclidean distance, and discuss the implementation of the technique upon a popular index structure. Extensive experiments confirm the superiority of this technique over recent state-of-the-art schemes.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121666550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data management projects at Google","authors":"Wilson C. Hsieh, J. Madhavan, Robin C. Pike","doi":"10.1145/1142473.1142566","DOIUrl":"https://doi.org/10.1145/1142473.1142566","url":null,"abstract":"This session describes three data management projects at Google. BigTable is a highly scalable system for distributed storage and querying of structured data. Sawzall is a system for large-scale analysis of data sets that have a flat but regular structure. Finally, GoogleBase is a system for storing and searching structured data contributed by external parties.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"38 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113938872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chee-Yong Chan, H. Jagadish, K. Tan, A. Tung, Zhenjie Zhang
{"title":"Finding k-dominant skylines in high dimensional space","authors":"Chee-Yong Chan, H. Jagadish, K. Tan, A. Tung, Zhenjie Zhang","doi":"10.1145/1142473.1142530","DOIUrl":"https://doi.org/10.1145/1142473.1142530","url":null,"abstract":"Given a d-dimensional data set, a point p dominates another point q if it is better than or equal to q in all dimensions and better than q in at least one dimension. A point is a skyline point if there does not exists any point that can dominate it. Skyline queries, which return skyline points, are useful in many decision making applications.Unfortunately, as the number of dimensions increases, the chance of one point dominating another point is very low. As such, the number of skyline points become too numerous to offer any interesting insights. To find more important and meaningful skyline points in high dimensional space, we propose a new concept, called k-dominant skyline which relaxes the idea of dominance to k-dominance. A point p is said to k-dominate another point q if there are k ≤ d dimensions in which p is better than or equal to q and is better in at least one of these k dimensions. A point that is not k-dominated by any other points is in the k-dominant skyline.We prove various properties of k-dominant skyline. In particular, because k-dominant skyline points are not transitive, existing skyline algorithms cannot be adapted for k-dominant skyline. We then present several new algorithms for finding k-dominant skyline and its variants. Extensive experiments show that our methods can answer different queries on both synthetic and real data sets efficiently.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114558117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gautam Das, Vagelis Hristidis, Nishant Kapoor, S. Sudarshan
{"title":"Ordering the attributes of query results","authors":"Gautam Das, Vagelis Hristidis, Nishant Kapoor, S. Sudarshan","doi":"10.1145/1142473.1142518","DOIUrl":"https://doi.org/10.1145/1142473.1142518","url":null,"abstract":"There has been a great deal of interest in the past few years on ranking of results of queries on structured databases, including work on probabilistic information retrieval, rank aggregation, and algorithms for merging of ordered lists. In many applications, for example sales of homes, used cars or electronic goods, data items have a very large number of attributes. When displaying a (ranked) list of items to users, only a few attributes can be shown. Traditionally, these are selected manually. We argue that automatic selection of attributes is required to deal with different requirements of different users. We formulate the problem as an optimization problem of choosing the most \"useful\" set of attributes, that is, the attributes that are most influential in the ranking of the items. We discuss different variants of our notion of attribute usefulness, and propose a hybrid Split-Pane approach that returns a composite of the top attributes of each variant. We conduct both a performance and a user study illustrating the benefits of our algorithms in terms of efficiency and quality of explanation.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121711818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}