ACM Transactions on Database Systems最新文献_第5页

Incremental Graph Computations: Doable and Undoable 增量图计算:可行和不可行的

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-05-23 DOI: https://dl.acm.org/doi/full/10.1145/3500930

Wenfei Fan, Chao Tian

{"title":"Incremental Graph Computations: Doable and Undoable","authors":"Wenfei Fan, Chao Tian","doi":"https://dl.acm.org/doi/full/10.1145/3500930","DOIUrl":"https://doi.org/https://dl.acm.org/doi/full/10.1145/3500930","url":null,"abstract":"The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ), graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ), if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ). We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"7 2","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Embedded Functional Dependencies and Data-completeness Tailored Database Design 嵌入式功能依赖和数据完整性定制数据库设计

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2021-05-30 DOI: 10.1145/3450518

Ziheng Wei, Sebastian Link

{"title":"Embedded Functional Dependencies and Data-completeness Tailored Database Design","authors":"Ziheng Wei, Sebastian Link","doi":"10.1145/3450518","DOIUrl":"https://doi.org/10.1145/3450518","url":null,"abstract":"We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"32 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2021-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Constant-Delay Enumeration for Nondeterministic Document Spanners 不确定文档生成器的恒定延迟枚举

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2021-04-14 DOI: 10.1145/3436487

Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth

{"title":"Constant-Delay Enumeration for Nondeterministic Document Spanners","authors":"Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth","doi":"10.1145/3436487","DOIUrl":"https://doi.org/10.1145/3436487","url":null,"abstract":"We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"22 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Functional Aggregate Queries with Additive Inequalities 具有可加不等式的函数聚合查询

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2020-12-06 DOI: 10.1145/3426865

KhamisMahmoud Abo, R. CurtinRyan, MoseleyBenjamin, Q. NgoHung, NguyenXuanlong, OlteanuDan, SchleichMaximilian

引用次数: 10

Efficient Sorting, Duplicate Removal, Grouping, and Aggregation 高效排序、重复删除、分组和聚合

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2020-10-01 DOI: 10.1145/3568027

Thanh Do, G. Graefe, J. Naughton

{"title":"Efficient Sorting, Duplicate Removal, Grouping, and Aggregation","authors":"Thanh Do, G. Graefe, J. Naughton","doi":"10.1145/3568027","DOIUrl":"https://doi.org/10.1145/3568027","url":null,"abstract":"Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 35"},"PeriodicalIF":1.8,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43440262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Conjunctive Queries: Unique Characterizations and Exact Learnability 连接查询:独特的表征和精确的可学习性

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2020-08-16 DOI: 10.1145/3559756

B. T. Cate, V. Dalmau

引用次数: 20

Efficient Enumeration Algorithms for Regular Document Spanners 常规文档生成器的高效枚举算法

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2020-02-08 DOI: 10.1145/3351451

FlorenzanoFernando, RiverosCristian, UgarteMartín, VansummerenStijn, VrgočDomagoj

引用次数: 21

Distributed Joins and Data Placement for Minimal Network Traffic 最小网络流量的分布式连接和数据放置

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2018-11-26 DOI: 10.1145/3241039

Orestis Polychroniou, Wangda Zhang, K. A. Ross

{"title":"Distributed Joins and Data Placement for Minimal Network Traffic","authors":"Orestis Polychroniou, Wangda Zhang, K. A. Ross","doi":"10.1145/3241039","DOIUrl":"https://doi.org/10.1145/3241039","url":null,"abstract":"Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"14:1-14:45"},"PeriodicalIF":1.8,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82766196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Relational Framework for Classifier Engineering 分类器工程的关系框架

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2018-11-26 DOI: 10.1145/3268931

B. Kimelfeld, C. Ré

{"title":"A Relational Framework for Classifier Engineering","authors":"B. Kimelfeld, C. Ré","doi":"10.1145/3268931","DOIUrl":"https://doi.org/10.1145/3268931","url":null,"abstract":"In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and, in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"11 1","pages":"11:1-11:36"},"PeriodicalIF":1.8,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89769531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The five color concurrency control protocol: non-two-phase locking in general databases 五色并发控制协议:一般数据库中的非两阶段锁定

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2018-03-02 DOI: 10.1145/78922.78927

P. Dasgupta, Z. Kedem

{"title":"The five color concurrency control protocol: non-two-phase locking in general databases","authors":"P. Dasgupta, Z. Kedem","doi":"10.1145/78922.78927","DOIUrl":"https://doi.org/10.1145/78922.78927","url":null,"abstract":"Concurrency control protocols based on two-phase locking are a popular family of locking protocols that preserve serializability in general (unstructured) database systems. A concurrency control algorithm (for databases with no inherent structure) is presented that is practical, non two-phase, and allows varieties of serializable logs not possible with any commonly known locking schemes. All transactions are required to predeclare the data they intend to read or write. Using this information, the protocol anticipates the existence (or absence) of possible conflicts and hence can allow non-two-phase locking.\u0000It is well known that serializability is characterized by acyclicity of the conflict graph representation of interleaved executions. The two-phase locking protocols allow only forward growth of the paths in the graph. The Five Color protocol allows the conflict graph to grow in any direction (avoiding two-phase constraints) and prevents cycles in the graph by maintaining transaction access information in the form of data-item markers. The read and write set information can also be used to provide relative immunity from deadlocks.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"29 1","pages":"281-307"},"PeriodicalIF":1.8,"publicationDate":"2018-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76949220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15