{"title":"What is special about spatial data science and Geo-AI?","authors":"S. Shekhar","doi":"10.1145/3468791.3472263","DOIUrl":"https://doi.org/10.1145/3468791.3472263","url":null,"abstract":"The importance of spatial data science and Geo-AI is growing with the rise of spatial and spatiotemporal big data (e.g., trajectories, remote-sensing images, census and geo-social media) [1-2]. Societal use cases include Agriculture (global crop monitoring, precision agriculture), Location-based services (e.g., navigation, ride-sharing), Public Health (e.g., monitoring disease spread), Environment and Climate (change detection, land-cover classification), Smart Cities (e.g., mapping buildings), etc. [1-2] Classical data science and AI (e.g., machine learning) often perform poorly when applied to spatial data sets because of the many reasons [1-5]. First, spatial data is embedded in a continuous space and classical statistics (e.g., correlation) are not robust to the modifiable areal unit problem. Second, spatial data-items have extended footprints (e.g., line strings, polygons) and implicit relationships (e.g., distance, touch). Third, high cost of spurious patterns requires guardrails (e.g., statistical significance tests) to reduce false positives. Furthermore, spatial autocorrelation and variability violate the classical assumption of data samples being generated independently from identical distributions, which risk models that are either inaccurate or inconsistent with the data. Thus, new methods are needed to analyze spatial data [1-5]. This talk surveys common and emerging methods for spatial classification and prediction (e.g., spatial autoregression, spatial decision trees [6], spatial variability aware neural networks [7]), as well as techniques for discovering interesting, useful and non-trivial patterns such as hotspots (e.g., circular, linear, arbitrary shapes [8]), interactions (e.g., co-locations [9], tele-connections), spatial outliers [10], and their spatio-temporal counterparts [3].","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116648806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah McClain, Manya Mutschler-Aldine, C. Monaghan, David Chiu, Jason Sawin, Patrick Jarvis
{"title":"Caching Support for Range Query Processing on Bitmap Indices","authors":"Sarah McClain, Manya Mutschler-Aldine, C. Monaghan, David Chiu, Jason Sawin, Patrick Jarvis","doi":"10.1145/3468791.3468800","DOIUrl":"https://doi.org/10.1145/3468791.3468800","url":null,"abstract":"Bitmaps are commonly used for indexing read-mostly data sets. The range of an attribute is split into bins, where its values are placed: bij = 1 denotes the value of the ith tuple is in the jth bin, and bij = 0 otherwise. A number of query types can be decomposed into the systematic application of boolean operators over sets of bins. However, when bitmaps are high-dimensional, the overall query-processing performance can deteriorate due to the increased number of bins that participate per query. We propose a caching framework that organizes, manages, and integrates cached partial results to accelerate query processing on high-dimensional bitmaps. We begin by showing that, to resolve general complex disjunctive and conjunctive queries, the selection of an optimal set of partial bitmap results is NP-complete. A restriction on this problem to only consider consecutive bin sequences (characteristic of common range and point queries) allows us to solve it efficiently. The evaluation our caching system over several workloads carried out on the TPC-H benchmark and a real network-intrusion data set is presented.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114281925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MISE: An Array-Based Integrated System for Atmospheric Scanning LiDAR","authors":"Kyoseung Koo, Juhun Kim, Bongki Moon","doi":"10.1145/3468791.3468829","DOIUrl":"https://doi.org/10.1145/3468791.3468829","url":null,"abstract":"Researchers suffer from two problems while building a data processing pipeline for atmospheric scanning LiDAR. First, they must build an entire system that handles collecting signals, processing data, and visualizing the results. Second, they should support fast data processing to expand and deploy their system. In this paper, we introduce MISE, a fast integrated system that handles atmospheric scanning LiDAR data. MISE provides end-to-end processing, configuration options, and predefined signal-processing methods. In addition, the system uses an efficient chunking approach for fast processing with an array database. We demonstrate the construction and operation of a fine-dust particle monitoring system (based on a real-world scenario) using MISE. This demonstration demonstrates the usability and fast performance of MISE.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128518841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAMBO - Indexing Dead Space to Accelerate Spatial Queries✱","authors":"Giannis Evagorou, T. Heinis","doi":"10.1145/3468791.3468804","DOIUrl":"https://doi.org/10.1145/3468791.3468804","url":null,"abstract":"With the increasing size and prevalence of spatial data across applications, efficiently indexing it becomes key. Minimum bounding boxes (MBBs) — i.e., axis-aligned rectangles that minimally enclose an object — used as approximations for complex geometric objects have become crucial for spatial indexes. MBBs succinctly summarize complex spatial objects and thus allow for an efficient filtering stage thanks to faster intersection tests. However, they introduce dead-space, i.e., space that is indexed but contains no spatial objects. Querying dead space gives no result but reads data from disk thus slowing down query execution unnecessarily. In this paper, we propose MaMBo (Meshed MBb), a grid-based data structure to index dead space in addition to an index of the spatial objects. We augment intersection operations of established indexes to consult our data structure while executing queries, thereby avoiding retrieval of unnecessary data from disk, i.e., data which only contains dead space. As our experiments show, we can significantly reduce I/O — the major overhead for disk-resident datasets — by over 50% when using MaMBo with an R-Tree.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124138236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian E. Schüle, T. Götz, A. Kemper, Thomas Neumann
{"title":"ArrayQL for Linear Algebra within Umbra","authors":"Maximilian E. Schüle, T. Götz, A. Kemper, Thomas Neumann","doi":"10.1145/3468791.3468838","DOIUrl":"https://doi.org/10.1145/3468791.3468838","url":null,"abstract":"Array database systems offer a declarative language for array-based access on multidimensional data. This study explains the integration of ArrayQL inside a relational database system, either addressable through a separate query interface or integrated into SQL as user-defined functions. With a relational database system as the target, we inherit the benefits such as query optimisation and multi-version concurrency control by design. Apart from SQL, having another query language allows processing the data without extraction or transformation out of its relational form. This is possible as we work on a relational array representation, for which we translate each ArrayQL operator into relational algebra. In our evaluation, ArrayQL within Umbra computes matrix operations faster than state of the art database extensions.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaozhou Liu, Yudi Santoso, Venkatesh Srinivasan, Alex Thomo
{"title":"Distributed Enumeration of Four Node Graphlets at Quadrillion-Scale","authors":"Xiaozhou Liu, Yudi Santoso, Venkatesh Srinivasan, Alex Thomo","doi":"10.1145/3468791.3468805","DOIUrl":"https://doi.org/10.1145/3468791.3468805","url":null,"abstract":"Graphlet enumeration is a basic task in graph analysis with many applications. Thus it is important to be able to perform this task within a reasonable amount of time. However, this objective is challenging when the input graph is very large, with millions of nodes and edges. Known solutions are limited in terms of scalability. Distributed computing is often proposed as a solution to improve scalability. However, it has to be done carefully to reduce the overhead cost and to really benefit from the distributed solution. We study the enumeration of four-node graphlets in undirected graphs using a distributed platform. We propose an efficient distributed solution which significantly surpasses the existing solutions. With this method we are able to process larger graphs that have never been processed before and enumerate quadrillions of graphlets using a modest cluster of machines. We show the scalability of our solution through experimental results. Finally, we also extend our algorithm to enumerate graphlets in probabilistic graphs and demonstrate its suitability for this case.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"2018 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114660963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Hotz, Theodoros Chondrogiannis, Leonard Wörteler, Michael Grossniklaus
{"title":"Online Landmark-Based Batch Processing of Shortest Path Queries","authors":"Manuel Hotz, Theodoros Chondrogiannis, Leonard Wörteler, Michael Grossniklaus","doi":"10.1145/3468791.3468844","DOIUrl":"https://doi.org/10.1145/3468791.3468844","url":null,"abstract":"Processing shortest path queries is a basic operation in many graph problems. Both preprocessing-based and batch processing techniques have been proposed to speed up the computation of a single shortest path by amortizing its costs. However, both of these approaches suffer from limitations. The former techniques are prohibitively expensive in situations where the precomputed information needs to be updated frequently due to changes in the graph, while the latter require coordinates and cannot be used on non-spatial graphs. In this paper, we address both limitations and propose novel techniques for batch processing shortest paths queries using landmarks. We show how preprocessing can be avoided entirely by integrating the computation of landmark distances into query processing. Our experimental results demonstrate that our techniques outperform the state of the art on both spatial and non-spatial graphs with a maximum speedup of 3.61 × in online scenarios.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117114035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. S. Pereira, Y. M. Souto, A. Silva, Rocio Zorilla, Brian Tsan, Florin Rusu, Eduardo S. Ogasawara, A. Ziviani, F. Porto
{"title":"DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models","authors":"R. S. Pereira, Y. M. Souto, A. Silva, Rocio Zorilla, Brian Tsan, Florin Rusu, Eduardo S. Ogasawara, A. Ziviani, F. Porto","doi":"10.1145/3468791.3468806","DOIUrl":"https://doi.org/10.1145/3468791.3468806","url":null,"abstract":"Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124814726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann
{"title":"In-Database Machine Learning with SQL on GPUs","authors":"Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann","doi":"10.1145/3468791.3468840","DOIUrl":"https://doi.org/10.1145/3468791.3468840","url":null,"abstract":"In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126427023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Selection of Analytic Platforms with ASAP-DM","authors":"M. Fritz, Gang Shao, H. Schwarz","doi":"10.1145/3468791.3468802","DOIUrl":"https://doi.org/10.1145/3468791.3468802","url":null,"abstract":"The plethora of available analytic platforms escalates the difficulty of selecting the most appropriate platform for a certain data mining task and datasets with varying characteristics. Especially novice analysts experience difficulties to keep up with the latest technical developments. In this demo, we present the ASAP-DM framework. ASAP-DM is able to automatically select a well-performing analytic platform for a given data mining task via an intuitive web interface, thus especially supporting novice analysts. The take-aways for demo attendees are: (1) a good understanding of the challenges of various data mining workloads, dataset characteristics, and the effects on the selection of analytic platforms, (2) useful insights on how ASAP-DM internally works, and (3) how to benefit from ASAP-DM for exploratory data analysis.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115565054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}