{"title":"Link Local Differential Privacy in GNNs via Bayesian Estimation","authors":"Xiaochen Zhu","doi":"10.1145/3555041.3589398","DOIUrl":"https://doi.org/10.1145/3555041.3589398","url":null,"abstract":"Recent years have witnessed the emergence of graph neural networks (GNNs) and an increasing amount of attention on GNNs from the data management community. Yet, training GNNs may raise privacy concerns as they may reveal sensitive information that must be kept private according to laws. In this paper, we study GNNs with link local differential privacy over decentralized nodes, where an untrusted server collaborates with node clients to train a GNN model without revealing the existence of any link. We find that by spending the privacy budget independently on links and degrees of the graph, the server can use Bayesian estimation to better denoise the graph topology. Unlike existing approaches, our mechanism does not aim to preserve graph density, but allows the server to estimate fewer links under lower privacy budget and higher uncertainty. Hence, the server makes fewer false positive link estimations and trains better models. Finally, we conduct extensive experiments to demonstrate that our method achieves considerably better performance with higher accuracy under same privacy budget compared to existing approaches.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134084041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disaggregated Database Systems","authors":"Jianguo Wang, Qizhen Zhang","doi":"10.1145/3555041.3589403","DOIUrl":"https://doi.org/10.1145/3555041.3589403","url":null,"abstract":"Disaggregated database systems achieve unprecedented excellence in elasticity and resource utilization at the cloud scale and have gained great momentum from both industry and academia recently. Such systems are developed in response to the emerging trend of disaggregated data centers where resources are physically separated and connected through fast data center networks. Database management systems have been traditionally built based on monolithic architectures, so disaggregation fundamentally challenges the designs. On the other hand, disaggregation offers benefits like independent scaling of compute, memory, and storage. Nonetheless, there is a lack of systematic investigation into new research challenges and opportunities in recent disaggregated database systems. To provide database researchers and practitioners with insights into different forms of resource disaggregation, we take a snapshot of state-of-the-art disaggregated database systems and related techniques and present an in-depth tutorial. The primary goal is to better understand the enabling techniques and characteristics of resource disaggregation and its implications for next-generation database systems. To that end, we survey recent work on storage disaggregation, which separates secondary storage devices (e.g., SSDs) from compute servers and is widely deployed in current cloud data centers, and memory disaggregation, which further splits compute and memory with Remote Direct Memory Access (RDMA) and is driving the transformation of clouds. In addition, we mention two techniques that bring novel perspectives to the above two paradigms: persistent memory and Compute Express Link (CXL). Finally, we identify several directions that shed light on the future development of disaggregated database systems.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121471158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"International Workshop on Data Management on New Hardware (DaMoN)","authors":"Norman May, Nesime Tatbul","doi":"10.1145/3555041.3590816","DOIUrl":"https://doi.org/10.1145/3555041.3590816","url":null,"abstract":"New hardware, such as multi-core CPUs, GPUs, FPGAs, new memory and storage technologies, low-power devices, bring new challenges and opportunities in optimizing database systems performance. Consequently, exploiting the characteristics of modern hardware has become an important topic of database systems research. In the last two decades, the DaMoN Workshop has established itself as the primary database venue to present ideas on how to exploit new hardware for data management, in particular how to improve performance or scalability of databases, how new hardware unlocks new database application scenarios, and how data management could benefit from future hardware.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124281664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sachin Basil John, P. Lindner, Zhekai Jiang, Christoph E. Koch
{"title":"Aggregation and Exploration of High-Dimensional Data Using the Sudokube Data Cube Engine","authors":"Sachin Basil John, P. Lindner, Zhekai Jiang, Christoph E. Koch","doi":"10.1145/3555041.3589729","DOIUrl":"https://doi.org/10.1145/3555041.3589729","url":null,"abstract":"We present Sudokube, a novel system that supports interactive speed querying on high-dimensional data using partially materialized data cubes. Given a storage budget, it judiciously chooses what projections to precompute and materialize during cube construction time. Then, at query time, it uses whatever information is available from the materialized projections and extrapolates missing information to approximate query results. Thus, Sudokube avoids costly projections at query time while also avoiding the astronomical compute and storage requirements needed for fully materialized high-dimensional data cubes. In this paper, we show the capabilities of the Sudokube system and how it approximates query results using different techniques and materialization strategies.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126216527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra
{"title":"Fast Natural Language Based Data Exploration with Samples","authors":"Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra","doi":"10.1145/3555041.3589724","DOIUrl":"https://doi.org/10.1145/3555041.3589724","url":null,"abstract":"The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121651707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"49 Years of Queries","authors":"D. Chamberlin","doi":"10.1145/3555041.3589336","DOIUrl":"https://doi.org/10.1145/3555041.3589336","url":null,"abstract":"The relational data model, proposed by Ted Codd in the 1970s, has been the dominant paradigm for storing and accessing business data for several decades. In this talk, I'll share some stories from the early days of relational databases, and examine some reasons for the remarkable resilience of relational database technology. I'll discuss some of the challenges to the relational approach that have arisen over the years. I'll also discuss the evolution of SQL, and offer some thoughts about how the language may continue to evolve in the future.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122290982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marie Fischer, Paul Roessler, Paul Sieben, Janina Adamcic, Christoph Kirchherr, Tobias Straeubig, Youri Kaminsky, Felix Naumann
{"title":"BCNF* - From Normalized- to Star-Schemas and Back Again","authors":"Marie Fischer, Paul Roessler, Paul Sieben, Janina Adamcic, Christoph Kirchherr, Tobias Straeubig, Youri Kaminsky, Felix Naumann","doi":"10.1145/3555041.3589712","DOIUrl":"https://doi.org/10.1145/3555041.3589712","url":null,"abstract":"Data warehouses are the core of many data analysis processes. They contain various database schemas, which are designed and created through schema transformation and integration. These processes are complex and require technical knowledge, which makes them costly and prevents business teams to start new analyses independently. BCNF* is a web application that enables users to safely explore valid schema transformations and generate transformation scripts automatically. It can be used for any schema transformation, but is optimized for semi-automatic data warehouse creation through means like a dedicated star schema mode.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Joint Shapley Values","authors":"Mihail Stoian","doi":"10.1145/3555041.3589393","DOIUrl":"https://doi.org/10.1145/3555041.3589393","url":null,"abstract":"The Shapley value has recently drawn the attention of the data management community. Briefly, the Shapley value is a well-known numerical measure for the contribution of a player to a coalitional game. In the direct extension of Shapley axioms, the newly introduced joint Shapley value provides a measure for the average contribution of a set of players. However, due to its exponential nature, it is computationally intensive: for an explanation order of k, the original algorithm takes O(min(3^n, 2^n n^k)) time. In this work, we improve it to O(2^n nk).","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133567447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster FFT-based Wildcard Pattern Matching","authors":"Mihail Stoian","doi":"10.1145/3555041.3589391","DOIUrl":"https://doi.org/10.1145/3555041.3589391","url":null,"abstract":"We study the problem of pattern matching with wildcards, which naturally occurs in the SQL expression like. It consists in finding the occurrences of a pattern P, |P| = m, in a text T, |T| = n, where the pattern possibly contains wildcards, i.e., special characters that can match any letter of the alphabet. The naive algorithm to this problem achieves O(nm) since in O(m) we need to check at each position of T whether a match is possible. For this purpose, several algorithms have been proposed, the simplest one being a deterministic FFT-based algorithm where pattern matching is interpreted in algebraic form, i.e., P = T iff (P-T)^2 = 0. This naturally leads to an O(n log n) algorithm via FFT, as we can evaluate the binomial and search for zero-valued coefficients. Clifford et al. introduce a trick to achieve O(n log m): Instead of matching the entire text to the pattern, the text is divided into n / m overlapping slices of length 2m, which are then matched to the pattern in O(m log m). The total time complexity is then O((n / m) m log m) = O(n log m). We mention that other works, especially in pattern matching with errors, rely on this trick. However, the O-expression hides in this case a factor of 4, assuming m = 2^k. This is because FFT-based matching between strings of length m and 2m, respectively, actually requires 4m log 4m steps, since the result is of size 3m - 1 and FFT requires a power of two as the size. We argue that this trick incurs redundancy, and show how it can be discarded to achieve a twice as fast O(n log m) algorithm without compromise. Furthermore, we show by experiments that the proposed algorithm approaches the theoretical improvement.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132528597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George C. Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, I. Pandis, Balakrishnan Narayanaswamy
{"title":"Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshift","authors":"Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George C. Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, I. Pandis, Balakrishnan Narayanaswamy","doi":"10.1145/3555041.3589677","DOIUrl":"https://doi.org/10.1145/3555041.3589677","url":null,"abstract":"There has been a lot of excitement around using machine learning to improve the performance and usability of database systems. However, few of these techniques have actually been used in the critical path of customer-facing database services. In this paper, we describe Auto-WLM, a machine learning based automatic workload manager currently used in production in Amazon Redshift. Auto-WLM is an example of how machine learning can improve the performance of large data-warehouses in practice and at scale. Auto-WLM intelligently schedules workloads to maximize throughput and horizontally scales clusters in response to workload spikes. While traditional heuristic-based workload management requires a lot of manual tuning (e.g. of the concurrency level, memory allocated to queries etc.) for each specific workload, Auto-WLM does this tuning automatically and as a result is able to quickly adapt and react to workload changes and demand spikes. At its core, Auto-WLM uses locally-trained query performance models to predict the query execution time and memory needs for each query, and uses this to make intelligent scheduling decisions. Currently, Auto-WLM makes millions of decisions every day, and constantly optimizes the performance for each individual Amazon Redshift cluster. In this paper, we will describe the advantages and challenges of implementing and deploying Auto-WLM, as well as outline areas of research that may be of interest to those in the \"ML for systems'' community with an eye for practicality.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116045221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}