ACM Transactions on Database Systems最新文献_第4页

Efficiently Cleaning Structured Event Logs: A Graph Repair Approach 有效地清理结构化事件日志:一种图修复方法

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-11-17 DOI: 10.1145/3571281

Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei

{"title":"Efficiently Cleaning Structured Event Logs: A Graph Repair Approach","authors":"Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei","doi":"10.1145/3571281","DOIUrl":"https://doi.org/10.1145/3571281","url":null,"abstract":"Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"31 1","pages":"1 - 44"},"PeriodicalIF":1.8,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90407660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conjunctive Queries: Unique Characterizations and Exact Learnability 连接查询:独特的表征和精确的可学习性

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-11-06 DOI: https://dl.acm.org/doi/10.1145/3559756

Balder Ten Cate, Victor Dalmau

引用次数: 0

Deciding Robustness for Lower SQL Isolation Levels 决定较低SQL隔离级别的健壮性

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-11-06 DOI: https://dl.acm.org/doi/10.1145/3561049

Bas Ketsman, Christoph Koch, Frank Neven, Brecht Vandevoort

{"title":"Deciding Robustness for Lower SQL Isolation Levels","authors":"Bas Ketsman, Christoph Koch, Frank Neven, Brecht Vandevoort","doi":"https://dl.acm.org/doi/10.1145/3561049","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3561049","url":null,"abstract":"While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"20 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust and Efficient Sorting with Offset-value Coding 基于偏移值编码的鲁棒高效排序

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-09-17 DOI: 10.1145/3570956

Thanh Do, G. Graefe

引用次数: 1

Proximity Queries on Terrain Surface 地形表面邻近查询

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-09-16 DOI: 10.1145/3563773

Victor Junqiu Wei, R. C. Wong, Cheng Long, D. Mount, H. Samet

{"title":"Proximity Queries on Terrain Surface","authors":"Victor Junqiu Wei, R. C. Wong, Cheng Long, D. Mount, H. Samet","doi":"10.1145/3563773","DOIUrl":"https://doi.org/10.1145/3563773","url":null,"abstract":"Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities. Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the Space Efficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE. Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 59"},"PeriodicalIF":1.8,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41730175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Deciding Robustness for Lower SQL Isolation Levels 决定较低SQL隔离级别的健壮性

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-09-03 DOI: 10.1145/3561049

Bas Ketsman, Christoph E. Koch, F. Neven, Brecht Vandevoort

{"title":"Deciding Robustness for Lower SQL Isolation Levels","authors":"Bas Ketsman, Christoph E. Koch, F. Neven, Brecht Vandevoort","doi":"10.1145/3561049","DOIUrl":"https://doi.org/10.1145/3561049","url":null,"abstract":"While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 41"},"PeriodicalIF":1.8,"publicationDate":"2022-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46309024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Persistent Summaries 持续的总结

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531053

Tianjing Zeng, Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen

{"title":"Persistent Summaries","authors":"Tianjing Zeng, Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen","doi":"https://dl.acm.org/doi/10.1145/3531053","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531053","url":null,"abstract":"A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy.All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 4","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Finding Rank Regret Representatives 关于寻找等级遗憾代表

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531054

Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao

{"title":"On Finding Rank Regret Representatives","authors":"Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao","doi":"https://dl.acm.org/doi/10.1145/3531054","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531054","url":null,"abstract":"Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function.However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"9 1-2","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Influence Maximization Revisited: Efficient Sampling with Bound Tightened 影响最大化再论:边界收紧的有效采样

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3533817

Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang

{"title":"Influence Maximization Revisited: Efficient Sampling with Bound Tightened","authors":"Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang","doi":"https://dl.acm.org/doi/10.1145/3533817","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3533817","url":null,"abstract":"Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms.In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem.Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment.Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 4","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration 用随机访问和随机顺序枚举回答连接查询的并集

IF 1.8 2区计算机科学

ACM Transactions on Database Systems Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531055

Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt

{"title":"Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration","authors":"Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt","doi":"https://dl.acm.org/doi/10.1145/3531055","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531055","url":null,"abstract":"As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given.We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"84 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0