{"title":"QuTE: Answering Quantity Queries from Web Tables","authors":"Vinh Thinh Ho, K. Pal, G. Weikum","doi":"10.1145/3448016.3452763","DOIUrl":"https://doi.org/10.1145/3448016.3452763","url":null,"abstract":"Quantities are financial, technological, physical and other measures that denote relevant properties of entities, such as revenue of companies, energy efficiency of cars or distance and brightness of stars and galaxies. Queries with filter conditions on quantities are an important building block for downstream analytics, and pose challenges when the content of interest is spread across a huge number of web tables and other ad-hoc datasets. Search engines support quantity lookups, but largely fail on quantity filters. The QuTE system presented in this paper aims to overcome these problems. It comprises methods for automatically extracting entity-quantity facts from web tables, as well as methods for online query processing, with new techniques for query matching and answer ranking.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115152048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Umur Cubukcu, Ozgün Erdogan, Sumedh Pathak, Sudhakar Sannakkayala, M. Ślot
{"title":"Citus","authors":"Umur Cubukcu, Ozgün Erdogan, Sumedh Pathak, Sudhakar Sannakkayala, M. Ślot","doi":"10.1145/3448016.3457551","DOIUrl":"https://doi.org/10.1145/3448016.3457551","url":null,"abstract":"Citus is an open source distributed database engine for PostgreSQL that is implemented as an extension. Citus gives users the ability to distribute data, queries, and transactions in PostgreSQL across a cluster of PostgreSQL servers to handle the needs of data-intensive applications. The development of Citus has largely been driven by conversations with companies looking to scale PostgreSQL beyond a single server and their workload requirements. This paper describes the requirements of four common workload patterns and how Citus addresses those requirements. It also shares benchmark results demonstrating the performance and scalability of Citus in each of the workload patterns and describes how Microsoft uses Citus to address one of its most challenging data problems.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115228379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure-Aware Machine Learning over Multi-Relational Databases","authors":"Maximilian Schleich","doi":"10.1145/3448016.3461670","DOIUrl":"https://doi.org/10.1145/3448016.3461670","url":null,"abstract":"We consider the problem of computing machine learning models over multi-relational databases. The mainstream approach involves a costly repeated loop that data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and learn the desired model using this tool. In this thesis, we advocate for an alternative approach that avoids this loop and instead tightly integrates the query and learning tasks into one unified solution. By integrating these two tasks, we can exploit structure in the data and the query to optimize the end-to-end learning problem. We provide a framework for structure-aware learning for a variety of commonly used machine learning models that achieves runtime guarantees that can be asymptotically faster than the mainstream approach that first constructs the training dataset. In practice, this asymptotic gap translates into several orders of magnitude performance improvements over state-of-the-art machine learning packages such as TensorFlow, MADlib, scikit-learn, and mlpack. The thesis is composed of three parts. First, we present the methodology and theoretical foundation of structure-aware learning. Then, we report on the design and implementation of LMFAO, an in-memory engine for structure-aware learning over databases. Finally, we present an extensive experimental evaluation. In following, we briefly highlight each of these three parts.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124247227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MxTasks: How to Make Efficient Synchronization and Prefetching Easy","authors":"J. Mühlig, J. Teubner","doi":"10.1145/3448016.3457268","DOIUrl":"https://doi.org/10.1145/3448016.3457268","url":null,"abstract":"The hardware environment has changed rapidly in recent years: Many cores, multiple sockets, and large amounts of main memory have become a commodity. To benefit from these highly parallel systems, the software has to be adapted. Sophisticated latch-free data structures and algorithms are often meant to address the situation. But they are cumbersome to develop and may still not provide the desired scalability. As a remedy, we present MxTasking, a task-based framework that assists the design of latch-free and parallel data structures. MxTasking eases the information exchange between applications and the operating system, resulting in novel opportunities to manage resources in a truly hardware- and application-conscious way.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124932881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Framework for Differentially Private Data Analysis with Multiple Accuracy Requirements","authors":"K. Knopf","doi":"10.1145/3448016.3450587","DOIUrl":"https://doi.org/10.1145/3448016.3450587","url":null,"abstract":"Organizations who collect sensitive data, such as hospitals or governments, may want to share the data with others. There could be multiple applications or analysts that want to use this data. Directly releasing the data could violate the privacy of individual data contributors. To address this privacy concern, differential privacy [1,2] has arisen as a popular technique for allow for sensitive data analysis. It frequently works through the addition of randomized noise to the output of the analysis, which is controlled through the privacy parameter or budget ε. This noise affects the utility of the analyses, where a smaller budget allocation results in larger noise values, and some applications may set accuracy requirements on the output to restrict the amount of noise added [3,9,10]. The total privacy loss of a sequence of differentially private mechanisms can be composed by summing up the privacy budgets they use, under the property of sequential composition [2]. Hence, if we intend to run multiple applications or analyses on the same dataset, given a total privacy budget, we can support each application by splitting the privacy budget evenly among them. However, if there are many applications, the privacy budget received per application could be very small, resulting in poor overall utility.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125836404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiently Supporting Adaptive Multi-Level Serializability Models in Distributed Database Systems","authors":"Zhanhao Zhao","doi":"10.1145/3448016.3450579","DOIUrl":"https://doi.org/10.1145/3448016.3450579","url":null,"abstract":"Informally, serializability means that transactions appear to have occurred in some total order. In this paper, we show that only the serializability guarantee with some total order is not enough for many real applications. As a complement, extra partial orders of transactions, like real-time order and program order, need to be introduced. Motivated by this observation, we present a framework that models serializable transactions by adding extra partial orders, namely multi-level serializability models. Following this framework, we propose a novel concurrency control algorithm, called bi-directionally timestamp adjustment (BDTA), to supporting multi-level serializability models in distributed database systems. We integrate the framework and BDTA into Greenplum and Deneva to show the benefits of our work. Our experiments show the performance gaps among serializability levels and confirm BDTA achieves up to 1.7× better than state-of-the-art concurrency control algorithms.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116420615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"P\u0000 2\u0000 B-Trace","authors":"Zhe Peng, Cheng Xu., Haixin Wang, Jinbin Huang, Jianliang Xu, Xiaowen Chu","doi":"10.1145/3448016.3459237","DOIUrl":"https://doi.org/10.1145/3448016.3459237","url":null,"abstract":"The eruption of a pandemic, such as COVID-19, can cause an unprecedented global crisis. Contact tracing, as a pillar of communicable disease control in public health for decades, has shown its effectiveness on pandemic control. Despite intensive research on contact tracing, existing schemes are vulnerable to attacks and can hardly simultaneously meet the requirements of data integrity and user privacy. The design of a privacy-preserving contact tracing framework to ensure the integrity of the tracing procedure has not been sufficiently studied and remains a challenge. In this paper, we propose P2B-Trace, a privacy-preserving contact tracing initiative based on blockchain. First, we design a decentralized architecture with blockchain to record an authenticated data structure of the user's contact records, which prevents the user from intentionally modifying his local records afterward. Second, we develop a zero-knowledge proximity verification scheme to further verify the user's proximity claim while protecting user privacy. We implement P2B-Trace and conduct experiments to evaluate the cost of privacy-preserving tracing integrity verification. The evaluation results demonstrate the effectiveness of our proposed system.","PeriodicalId":360379,"journal":{"name":"Proceedings of the 2021 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122619095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}