Tsz Nam Chan, Rui Zang, Pak Lon Ip, Leong Hou U, Jianliang Xu
{"title":"PyNKDV: An Efficient Network Kernel Density Visualization Library for Geospatial Analytic Systems","authors":"Tsz Nam Chan, Rui Zang, Pak Lon Ip, Leong Hou U, Jianliang Xu","doi":"10.1145/3555041.3589711","DOIUrl":"https://doi.org/10.1145/3555041.3589711","url":null,"abstract":"Network kernel density visualization (NKDV) is an important tool for many application domains, including criminology and transportation science. However, all existing software tools, e.g., SANET (a plug-in for QGIS and ArcGIS) and spNetwork (an R package), adopt the naïve implementation of NKDV, which does not scale to large-scale location datasets and high-resolution sizes. To overcome this issue, we develop the first python library, called PyNKDV, which adopts our complexity-reduced solution and its parallel implementation to significantly improve the efficiency for generating NKDV. Moreover, PyNKDV is also user friendly (with four lines of python code) and can support commonly used geospatial analytic systems (e.g., QGIS and ArcGIS). In this demonstration, we will use three large-scale location datasets (up to 7.71 million data points), provide different python scripts (in the Jupyter Notebook), and install existing software tools (i.e., SANET and spNetwork) for participants to (1) explore different functionalities of our PyNKDV library and (2) compare its practical efficiency with existing software tools.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123548524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsz Nam Chan, Leong Hou U, Byron Choi, Jianliang Xu, R. Cheng
{"title":"Large-scale Geospatial Analytics: Problems, Challenges, and Opportunities","authors":"Tsz Nam Chan, Leong Hou U, Byron Choi, Jianliang Xu, R. Cheng","doi":"10.1145/3555041.3589401","DOIUrl":"https://doi.org/10.1145/3555041.3589401","url":null,"abstract":"Geospatial analytics is an important field in many communities, including crime science, transportation science, epidemiology, ecology, and urban planning. However, with the rapid growth of big geospatial data, most of the commonly used geospatial analytic tools are not efficient (or even feasible) to support large-scale datasets. As such, domain experts have raised the concerns about the inefficiency issues for using these tools. In this tutorial, we aim to arouse the attention of database researchers for this important, emerging, database-related, and interdisciplinary topic, which consists of four parts. In the first part, we will discuss different problems and highlight the challenges for two types of geospatial analytic tools, which are (1) hotspot detection and (2) correlation analysis. In the second and third parts, we will specifically discuss two geospatial analytic tools, namely kernel density visualization (the representative hotspot detection method) and K-function (the representative correlation analysis method), respectively, and their variants. In the fourth part, we will highlight the future opportunities for this topic.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125130712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haneen Mohammed, Charlie Summers, Sughosh Kaushik, Eugene Wu
{"title":"SmokedDuck Demonstration: SQLStepper","authors":"Haneen Mohammed, Charlie Summers, Sughosh Kaushik, Eugene Wu","doi":"10.1145/3555041.3589731","DOIUrl":"https://doi.org/10.1145/3555041.3589731","url":null,"abstract":"Fine-grained lineage tracks the relationships between input and output of a query, and is particularly useful in analytical applications such as query debugging, view maintenance, query explanations, and data cleaning. Prior approaches rewrite SQL queries to also track lineage, but can slow query execution in analytical engines that are designed to process complex query patterns on large datasets. Moreover, they mainly capture lineage at the logical level. SmokedDuck extends DuckDB to support fast lineage capture and querying by tracking lineage at the instruction level by leveraging the duality between lineage and data movement. In this demonstration, we show how a user can leverage operator-level lineage to understand and debug a query execution through SQLStepper: an application built on top of SmokedDuck. Users upload data and execute queries using an in-browser command line, then explore query-level and operator-level lineage visually to track down bugs.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125171125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Alonso, N. Ailamaki, S. Krishnamurthy, S. Madden, S. Sivasubramanian, R. Ramakrishnan
{"title":"Future of Database System Architectures","authors":"G. Alonso, N. Ailamaki, S. Krishnamurthy, S. Madden, S. Sivasubramanian, R. Ramakrishnan","doi":"10.1145/3555041.3589360","DOIUrl":"https://doi.org/10.1145/3555041.3589360","url":null,"abstract":"Over the past two decades, we have experienced major technology disruptions on multiple fronts, none bigger than the emergence of cloud computing, which has led to fundamental changes in how database software is architected. We are seeing several new trends that are similarly shaping the future of data management. With the demise of Moore's Law, we are now seeing a lot of interest (and start-ups with significant investments) in hardware database accelerators, exploring FPGAs, GPUs, and more. Economies of scale in the cloud make it possible to move to hardware many things that were done in software, the trend will continue and increase. Modern data estates are spread across data located on premises, on the edge and in one or more public clouds, spread across various sources like multiple relational databases, file and storage systems, and no-SQL systems, both operational and analytic. This phenomenon is referred to as data sprawl. We are also seeing the emergence of many novel data workloads. For example, rich data pipelines are an increasingly common workload. And finally, Machine Learning is having a rapidly increasing role in every aspect of the database software lifecycle. This SIGMOD panel will discuss the impact of the above changes and trends on database hardware and software architectures. How will these changes impact DB system design, how will DB systems look like in the near future? Where are the hardest research challenges? What learnings from the past will guide us through these disruptions?","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115437289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Demonstration of KAMEL: A Scalable BERT-based System for Trajectory Imputation","authors":"Mashaal Musleh, M. Mokbel","doi":"10.1145/3555041.3589733","DOIUrl":"https://doi.org/10.1145/3555041.3589733","url":null,"abstract":"This demo presents KAMEL; a novel trajectory imputation framework that aims to impute sparse trajectories as a means of increasing their accuracy, and hence the accuracy of their applications. Unlike the large majority of current trajectory imputation techniques, KAMEL does not require the knowledge or the availability of the underlying road network, which makes it applicable to important applications like map inference that need to infer the road network itself. Audience will experience KAMEL through various scenarios that show the imputation accuracy as well as KAMEL internals.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129094308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Demonstrating NaturalMiner: Searching Large Data Sets for Abstract Patterns Described in Natural Language","authors":"Immanuel Trummer","doi":"10.1145/3555041.3589694","DOIUrl":"https://doi.org/10.1145/3555041.3589694","url":null,"abstract":"The NaturalMiner system seeks to extract facts from large relational data sets that match abstract patterns defined in natural language. For instance, this enables users to search, with regards to a specific airline, for evidence that \"the airline underperforms\" or \"the airline outperforms'' within a data set containing flight statistics, hinting at areas for improvements or strengths to advertise. Internally, NaturalMiner iteratively generates statistical facts from data by processing SQL queries, selecting facts to generate by a reinforcement learning approach. It uses pre-trained language models to score candidate facts with regards to user-specified search patterns, returning the fact combination with maximal score after a user-specified time budget. To deal with large data sets, NaturalMiner features customized caching and sampling strategies. The proposed demonstration will showcase search for different patterns described in natural language, covering different data sets and scenarios.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Boehm, Matteo Interlandi, Christopher M. Jermaine
{"title":"Optimizing Tensor Computations: From Applications to Compilation and Runtime Techniques","authors":"Matthias Boehm, Matteo Interlandi, Christopher M. Jermaine","doi":"10.1145/3555041.3589407","DOIUrl":"https://doi.org/10.1145/3555041.3589407","url":null,"abstract":"Machine learning (ML) training and scoring fundamentally relies on linear algebra programs and more general tensor computations. Most ML systems utilize distributed parameter servers and similar distribution strategies for mini-batch stochastic gradient descent training. However, many more tasks in the data science and engineering lifecycle can benefit from efficient tensor computations. Examples include primitives for data cleaning, data and model debugging, data augmentation, query processing, numerical simulations, as well as a wide variety of training and scoring algorithms. In this survey tutorial, we first make a case for the importance of optimizing more general tensor computations, and then provide an in-depth survey of existing applications, optimizing compilation techniques, and underlying runtime strategies. Interestingly, there are close connections to data-intensive applications, query rewriting and optimization, as well as query processing and physical design. Our goal for the tutorial is to structure existing work, create common terminology, and identify open research challenges.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130860110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Dong, Bo Li, Julia Stoyanovich, A. Tung, G. Weikum, A. Halevy, Wang-Chiew Tan
{"title":"Personal Data for Personal Use: Vision or Reality?","authors":"X. Dong, Bo Li, Julia Stoyanovich, A. Tung, G. Weikum, A. Halevy, Wang-Chiew Tan","doi":"10.1145/3555041.3589378","DOIUrl":"https://doi.org/10.1145/3555041.3589378","url":null,"abstract":"The vision of collecting all of one's personal information into one searchable database has been around at least since Vannevar Bush's 1945 paper on the Memex System [2]. In the late 1990's, Gordon Bell and his colleagues at Microsoft Research built MyLifeBits [1, 6], which was the first serious attempt to build such a database. Since then, there has been continued interest in our community to build personal information management systems [3-5, 7, 8, 10]. Recently, the Solid Project proposes a more radical approach to personal information, arguing that all of one's data should reside in their own data pod, and applications should be redesigned to fetch data from the pod [9].","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132102509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fairness in Ranking: From Values to Technical Choices and Back","authors":"Julia Stoyanovich, Meike Zehlike, Ke Yang","doi":"10.1145/3555041.3589405","DOIUrl":"https://doi.org/10.1145/3555041.3589405","url":null,"abstract":"In the past few years, there has been much work on incorporating fairness requirements into the design of algorithmic rankers, with contributions from the data management, algorithms, information retrieval, and recommender systems communities. In this tutorial, we give a systematic overview of this work, offering a broad perspective that connects formalizations and algorithmic approaches across subfields. During the first part of the tutorial, we present a classification framework for fairness-enhancing interventions, along which we will then relate the technical methods. This framework allows us to unify the presentation of mitigation objectives and of algorithmic techniques to help meet those objectives or identify trade-offs. Next, we discuss fairness in score-based ranking and in supervised learning-to-rank. We conclude with recommendations for practitioners, to help them select a fair ranking method based on the requirements of their specific application domain.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"33 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132899002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SparkSQL+: Next-generation Query Planning over Spark","authors":"Binyang Dai, Qichen Wang, K. Yi","doi":"10.1145/3555041.3589715","DOIUrl":"https://doi.org/10.1145/3555041.3589715","url":null,"abstract":"We will demonstrate SparkSQL+, a SQL processing engine built on top of Spark. Unlike the vanilla SparkSQL that uses classical query plans, SparkSQL+ adopts some of the recently developed new query plans, including generalized hypertree decompositions(GHD), worst-case optimal join (WCOJ) algorithms, and conjunctive queries with comparisons (CQC). SparkSQL+ also provides a platform for users to explore different query plans for a given query through a web-based interface, and compare their performance with classical query plans on the same Spark core.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129889489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}