Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems最新文献_第2页

Is min-wise hashing optimal for summarizing set intersection? 最小哈希法是总结集合交集的最佳方法吗?

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594554

R. Pagh, Morten Stöckel, David P. Woodruff

{"title":"Is min-wise hashing optimal for summarizing set intersection?","authors":"R. Pagh, Morten Stöckel, David P. Woodruff","doi":"10.1145/2594538.2594554","DOIUrl":"https://doi.org/10.1145/2594538.2594554","url":null,"abstract":"Min-wise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a \"min-hash\") of each set. One application is estimation of the number of data points that satisfy the conjunction of m >= 2 simple predicates, where a min-hash is available for the set of points satisfying each predicate. This has application in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with (1 +/- epsilon)-relative error? The state-of-the-art technique for minimizing the encoding size, for any desired estimation error, is b-bit min-wise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: Using information complexity arguments, we show that b-bit min-wise hashing is em space optimal for m=2 predicates in the sense that the estimator's variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m>2 predicates we show that the performance of b-bit min-wise hashing (and more generally any method based on \"k-permutation\" min-hash) deteriorates as m grows. We describe a new summary that nearly matches our lower bound for m >= 2. It asymptotically outperform all k-permutation schemes (by around a factor Omega(m/log m)), as well as methods based on subsampling (by a factor Omega(log n_max), where n_max is the maximum set size).","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133960455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Database principles in information extraction 信息抽取中的数据库原理

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594563

B. Kimelfeld

引用次数: 10

A dichotomy for non-repeating queries with negation in probabilistic databases 概率数据库中带有否定的非重复查询的二分法

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594549

Robert Fink, Dan Olteanu

引用次数: 10

Session details: Enumeration, counting, and probabilities 会话细节:枚举、计数和概率

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/3255785

Dan Suciu

引用次数: 0

Independent range sampling 独立量程抽样

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594545

Xiaocheng Hu, Miao Qiao, Yufei Tao

{"title":"Independent range sampling","authors":"Xiaocheng Hu, Miao Qiao, Yufei Tao","doi":"10.1145/2594538.2594545","DOIUrl":"https://doi.org/10.1145/2594538.2594545","url":null,"abstract":"This paper studies the independent range sampling problem. The input is a set P of n points in R. Given an interval q = [x, y] and an integer t ≥ 1, a query returns t elements uniformly sampled (with/without replacement) from P ∩ q. The sampling result must be independent from those returned by the previous queries. The objective is to store P in a structure for answering all queries efficiently. If P fits in memory, the problem is interesting when P is dynamic (i.e., allowing insertions and deletions). The state of the art is a structure of O(n) space that answers a query in O(t log n) time, and supports an update in O(log n) time. We describe a new structure of O(n) space that answers a query in O(log n + t) expected time, and supports an update in O(log n) time. If P does not fit in memory, the problem is challenging even when P is static. The best known structure incurs O(logB n + t) I/Os per query, where B is the block size. We develop a new structure of O(n/B) space that answers a query in O(log* (n/B) + logB n + (t/B) logM/B (n/B)) amortized expected I/Os, where M is the memory size, and log* (n/B) is the number of iterative log2(.) operations we need to perform on n/B before going below a constant. We also give a lower bound argument showing that this is nearly optimal---in particular, the multiplicative term logM/B (n/B) is necessary.","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126463215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Nested dependencies: structure and reasoning 嵌套依赖关系:结构和推理

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594544

Phokion G. Kolaitis, R. Pichler, Emanuel Sallinger, V. Savenkov

{"title":"Nested dependencies: structure and reasoning","authors":"Phokion G. Kolaitis, R. Pichler, Emanuel Sallinger, V. Savenkov","doi":"10.1145/2594538.2594544","DOIUrl":"https://doi.org/10.1145/2594538.2594544","url":null,"abstract":"During the past decade, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), and on schema mappings specified by second-order tgds (SO tgds), which constitute the closure of GLAV mappings under composition. In addition, nested GLAV mappings have also been considered, i.e., schema mappings specified by nested tgds, which have expressive power intermediate between s-t tgds and SO tgds. Even though nested GLAV mappings have been used in data exchange systems, such as IBM's Clio, no systematic investigation of this class of schema mappings has been carried out so far. In this paper, we embark on such an investigation by focusing on the basic reasoning tasks, algorithmic problems, and structural properties of nested GLAV mappings. One of our main results is the decidability of the implication problem for nested tgds. We also analyze the structure of the core of universal solutions with respect to nested GLAV mappings and develop useful tools for telling apart SO tgds from nested tgds. By discovering deeper structural properties of nested GLAV mappings, we show that also the following problem is decidable: given a nested GLAV mapping, is it logically equivalent to a GLAV mapping?","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132838497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Cleaning inconsistencies in information extraction via prioritized repairs 通过优先修复清除信息提取中的不一致

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594540

Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren

{"title":"Cleaning inconsistencies in information extraction via prioritized repairs","authors":"Ronald Fagin, B. Kimelfeld, Frederick Reiss, Stijn Vansummeren","doi":"10.1145/2594538.2594540","DOIUrl":"https://doi.org/10.1145/2594538.2594540","url":null,"abstract":"The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that extract the sought information without any inconsistencies (e.g., a substring should not be annotated as both an address and a person name). Dealing with inconsistencies is hence of crucial importance in IE systems. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad-hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way and, hence, it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE, though principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair), and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general, and some of which apply to policies used in practice.","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115671195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Session details: Tutorial 2 会话细节:教程2

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/3255786

P. Barceló

引用次数: 0

On scale independence for querying big data 关于查询大数据的规模独立性

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594551

W. Fan, Floris Geerts, L. Libkin

{"title":"On scale independence for querying big data","authors":"W. Fan, Floris Geerts, L. Libkin","doi":"10.1145/2594538.2594551","DOIUrl":"https://doi.org/10.1145/2594538.2594551","url":null,"abstract":"To make query answering feasible in big datasets, practitioners have been looking into the notion of scale independence of queries. Intuitively, such queries require only a relatively small subset of the data, whose size is determined by the query and access methods rather than the size of the dataset itself. This paper aims to formalize this notion and study its properties. We start by defining what it means to be scale-independent, and provide matching upper and lower bounds for checking scale independence, for queries in various languages, and for combined and data complexity. Since the complexity turns out to be rather high, and since scale-independent queries cannot be captured syntactically, we develop sufficient conditions for scale independence. We formulate them based on access schemas, which combine indexing and constraints together with bounds on the sizes of retrieved data sets. We then study two variations of scale-independent query answering, inspired by existing practical systems. One concerns incremental query answering: we check when query answers can be maintained in response to updates scale-independently. The other explores scale-independent query rewriting using views.","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Cost-oblivious storage reallocation 无关成本的存储重新分配

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Pub Date : 2014-06-18 DOI: 10.1145/2594538.2594548

M. A. Bender, Martín Farach-Colton, S. Fekete, Jeremy T. Fineman, Seth Gilbert

{"title":"Cost-oblivious storage reallocation","authors":"M. A. Bender, Martín Farach-Colton, S. Fekete, Jeremy T. Fineman, Seth Gilbert","doi":"10.1145/2594538.2594548","DOIUrl":"https://doi.org/10.1145/2594538.2594548","url":null,"abstract":"Databases allocate and free blocks of storage on disk. Freed blocks introduce holes where no data is stored. Allocation systems attempt to reuse such deallocated regions in order to minimize the footprint on disk. When previously allocated blocks cannot be moved, this problem is called the memory allocation problem. It is known to have a logarithmic overhead in the footprint size. This paper defines the storage reallocation problem, where previously allocated blocks can be moved, or reallocated, but at some cost. This cost is determined by the allocation/reallocation cost function. The algorithms presented here are cost oblivious, in that they work for a broad and reasonable class of cost functions, even when they do not know what the cost function actually is. The objective is to minimize the storage footprint, that is, the largest memory address containing an allocated object, while simultaneously minimizing the reallocation costs. This paper gives asymptotically optimal algorithms for storage reallocation, in which the storage footprint is at most (1+ε) times optimal, and the reallocation cost is at most O((1/ε)log(1/ε)) times the original allocation cost, which is asymptotically optimal for constant ε. The algorithms are cost oblivious, which means they achieve these bounds with no knowledge of the allocation/reallocation cost function, as long as the cost function is subadditive.","PeriodicalId":302451,"journal":{"name":"Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123353930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12