Nemanja Borić, Hinnerk Gildhoff, M. Karavelas, I. Pandis, Ioanna Tsalouchidou
{"title":"Unified Spatial Analytics from Heterogeneous Sources with Amazon Redshift","authors":"Nemanja Borić, Hinnerk Gildhoff, M. Karavelas, I. Pandis, Ioanna Tsalouchidou","doi":"10.1145/3318464.3384704","DOIUrl":"https://doi.org/10.1145/3318464.3384704","url":null,"abstract":"Enterprise companies use spatial data for decision optimization and gain new insights regarding the locality of their business and services. Industries rely on efficiently combining spatial and business data from different sources, such as data warehouses, geospatial information systems, transactional systems, and data lakes, where spatial data can be found in structured or unstructured form. In this demonstration we present the spatial functionality of Amazon Redshift and its integration with other Amazon services, such as Amazon Aurora PostgreSQL and Amazon S3. We focus on the design and functionality of the feature, including the extensions in Redshift's state-of-the-art optimizer to push spatial processing close to where the data is stored.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124741122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Truss-based Community Search over Large Directed Graphs","authors":"Qing Liu, Minjun Zhao, Xin Huang, Jianliang Xu, Yunjun Gao","doi":"10.1145/3318464.3380587","DOIUrl":"https://doi.org/10.1145/3318464.3380587","url":null,"abstract":"Community search enables personalized community discovery and has wide applications in large real-world graphs. While community search has been extensively studied for undirected graphs, the problem for directed graphs has received attention only recently. However, existing studies suffer from several drawbacks, e.g., the vertices with varied in-degrees and out-degrees cannot be included in a community at the same time. To address the limitations, in this paper, we systematically study the problem of community search over large directed graphs. We start by presenting a novel community model, called D-truss, based on two distinct types of directed triangles, i.e., flow triangle and cycle triangle. The D-truss model brings nice structural and computational properties and has many advantages in comparison with the existing models. With this new model, we then formulate the D-truss community search problem, which is proved to be NP-hard. In view of its hardness, we propose two efficient 2-approximation algorithms, named Global and Local, that run in polynomial time yet with quality guarantee. To further improve the efficiency of the algorithms, we devise an indexing method based on D-truss decomposition. Consequently, the D-truss community search can be solved upon the D-truss index without time-consuming accesses to the original graph. Experimental studies on real-world graphs with ground-truth communities validate the quality of the solutions we obtain and the efficiency of the proposed algorithms.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124703863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggregation Support for Modern Graph Analytics in TigerGraph","authors":"Alin Deutsch, Yu Xu, Mingxi Wu, Victor E. Lee","doi":"10.1145/3318464.3386144","DOIUrl":"https://doi.org/10.1145/3318464.3386144","url":null,"abstract":"We describe how GSQL, TigerGraph's graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power and the evaluation complexity of the specified aggregation. We detail our design showing how our ideas transcend GSQL and are eminently portable to the upcoming graph query language standards as well as the existing pattern-based declarative query languages.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127130984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theodoros Toliopoulos, Christos Bellas, A. Gounaris, A. Papadopoulos
{"title":"PROUD: PaRallel OUtlier Detection for Streams","authors":"Theodoros Toliopoulos, Christos Bellas, A. Gounaris, A. Papadopoulos","doi":"10.1145/3318464.3384688","DOIUrl":"https://doi.org/10.1145/3318464.3384688","url":null,"abstract":"We introduce PROUD, standing for PaRallel OUtlier Detection for streams, which is an extensible engine for continuous multi-parameter parallel distance-based outlier (or anomaly) detection tailored to big data streams. PROUD is built on top of Flink. It defines a simple API for data ingestion. It supports a variety of parallel techniques, including novel ones, for continuous outlier detection that can be easily configured. In addition, it graphically reports metrics of interest and stores main results into a permanent store to enable future analysis. It can be easily extended to support additional techniques. Finally, it is publicly provided in open-source.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates","authors":"Surabhi Gupta, S. Purandare, Karthik Ramachandra","doi":"10.1145/3318464.3389736","DOIUrl":"https://doi.org/10.1145/3318464.3389736","url":null,"abstract":"Loops that iterate over SQL query results are quite common, both in application programs that run outside the DBMS, as well as User Defined Functions (UDFs) and stored procedures that run within the DBMS. It can be argued that set-oriented operations are more efficient and should be preferred over iteration; but from real world use cases, it is clear that loops over query results are inevitable in many situations, and are preferred by many users. Such loops, known as cursor loops, come with huge trade-offs and overheads w.r.t. performance, resource consumption and concurrency. We present Aggify, a technique for optimizing loops over query results that overcomes these overheads. It achieves this by automatically generating custom aggregates that are equivalent in semantics to the loop. Thereby, Aggify completely eliminates the loop by rewriting the query to use this generated aggregate. This technique has several advantages such as: (i) pipelining of entire cursor loop operations instead of materialization, (ii) pushing down loop computation from the application layer into the DBMS, closer to the data, (iii) leveraging existing work on optimization of aggregate functions, resulting in efficient query plans. We describe the technique underlying Aggify, and present our experimental evaluation over benchmarks as well as real workloads that demonstrate the significant benefits of this technique.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129855846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automating Exploratory Data Analysis via Machine Learning: An Overview","authors":"T. Milo, Amit Somech","doi":"10.1145/3318464.3383126","DOIUrl":"https://doi.org/10.1145/3318464.3383126","url":null,"abstract":"Exploratory Data Analysis (EDA) is an important initial step for any knowledge discovery process, in which data scientists interactively explore unfamiliar datasets by issuing a sequence of analysis operations (e.g. filter, aggregation, and visualization). Since EDA is long known as a difficult task, requiring profound analytical skills, experience, and domain knowledge, a plethora of systems have been devised over the last decade in order to facilitate EDA. In particular, advancements in machine learning research have created exciting opportunities, not only for better facilitating EDA, but to fully automate the process. In this tutorial, we review recent lines of work for automating EDA. Starting from recommender systems for suggesting a single exploratory action, going through kNN-based classifiers and active-learning methods for predicting users' interestingness preferences, and finally to fully automating EDA using state-of-the-art methods such as deep reinforcement learning and sequence-to-sequence models. We conclude the tutorial with a discussion on the main challenges and open questions to be dealt with in order to ultimately reduce the manual effort required for EDA.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128898973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hub Labeling for Shortest Path Counting","authors":"Yikai Zhang, J. Yu","doi":"10.1145/3318464.3389737","DOIUrl":"https://doi.org/10.1145/3318464.3389737","url":null,"abstract":"The notion of shortest path is fundamental in graph analytics. While many works have devoted to devising efficient distance oracles to compute the shortest distance between any vertices s and t, we study the problem of efficiently counting the number of shortest paths between s and t in light of its applications in tasks such as betweenness-related analysis. Specifically, we propose a hub labeling scheme based on hub pushing and discuss several graph reduction techniques to reduce the index size. Furthermore, we prove several theoretical results on the performance of the scheme for some special graph classes. Our empirical study verifies the efficiency and effectiveness of the algorithms. In particular, a query evaluation takes only hundreds of microseconds in average for graphs with up to hundreds of millions of edges. We report our findings in this paper.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129002070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Challenge of Building Effective, Enterprise-scale Data Lakes","authors":"Awez Syed","doi":"10.1145/3318464.3393816","DOIUrl":"https://doi.org/10.1145/3318464.3393816","url":null,"abstract":"There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127810530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jong-Bin Kim, H. Cho, Kihwang Kim, Jaeseon Yu, Sooyong Kang, Hyungsoo Jung
{"title":"Long-lived Transactions Made Less Harmful","authors":"Jong-Bin Kim, H. Cho, Kihwang Kim, Jaeseon Yu, Sooyong Kang, Hyungsoo Jung","doi":"10.1145/3318464.3389714","DOIUrl":"https://doi.org/10.1145/3318464.3389714","url":null,"abstract":"Many systems use snapshot isolation, or something similar, as defaults, and multi-version concurrency control (MVCC) remains essential to offering such point-in-time consistency. One major issue in MVCC is the timely removal of unnecessary versions of data items, especially in the presence of long-lived transactions (LLTs). We have observed that the latest versions of MySQL and PostgreSQL are still vulnerable to LLTs. Our analysis of existing proposals suggests that new solutions to this matter must provide rigorous rules for completely identifying unnecessary versions, and elaborate designs for version cleaning lest old versions required for LLTs should suspend garbage collection. In this paper, we formalize such rules into our version pruning theorem and version classification, of which all form theoretical foundations for our new version management system, vDriver, that bases its record versioning on a new principle: Single In-row Remaining Off-row (SIRO) versioning. We implemented a prototype of vDriver and integrated it with MySQL-8.0 and PostgreSQL-12.0. The experimental evaluation demonstrated that the engines with Driver continue to perform the reclamation of dead versions in the face of LLTs while retaining transaction throughput with reduced space consumption.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120983381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjia He, Michael R. Anderson, M. Strome, Michael J. Cafarella
{"title":"A Method for Optimizing Opaque Filter Queries","authors":"Wenjia He, Michael R. Anderson, M. Strome, Michael J. Cafarella","doi":"10.1145/3318464.3389766","DOIUrl":"https://doi.org/10.1145/3318464.3389766","url":null,"abstract":"An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical \"query-independent\" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115111438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}