{"title":"Cohesive Subgraph Detection in Large Bipartite Networks","authors":"Y. Hao, Mengqi Zhang, Xiaoyang Wang, Chen Chen","doi":"10.1145/3400903.3400925","DOIUrl":"https://doi.org/10.1145/3400903.3400925","url":null,"abstract":"In real-world applications, bipartite graphs are widely used to model the relationships between two types of entities, such as customer-product relationship, gene co-expression, etc. As a fundamental problem, cohesive subgraph detection is of great importance for bipartite graph analysis. In this paper, we propose a novel cohesive subgraph model, named (α, β, ω)-core, which requires each node should have sufficient number of close neighbors. The model emphasizes both the engagement of entities and the strength of connections. To scale for large networks, efficient algorithm is developed to compute the (α, β, ω)-core. Compared with the existing cohesive subgraph models, we conduct the experiments over real-world bipartite graphs to verify the advantages of proposed model and techniques.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124153155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transit-based Task Assignment in Spatial Crowdsourcing","authors":"S. Gummidi, T. Pedersen, Xike Xie","doi":"10.1145/3400903.3400929","DOIUrl":"https://doi.org/10.1145/3400903.3400929","url":null,"abstract":"Worker movement information can help the spatial crowdsourcing platform to identify the right time to assign a task to a worker for successful completion of the task. However, the majority of the current assignment strategies do not consider worker movement information. This paper aims to utilize the worker movement information via transits in an online task assignment setting. The idea is to harness the waiting periods at different transit stops in a worker transit route (WTR) for performing the tasks. Given the limited availability of workers’ waiting periods at transit stops, task deadlines and workers’ preference of performing tasks with higher rewards, we define the Transit-based Task Assignment (TTA) problem. The objective of the TTA problem is to maximize the average worker rewards for motivating workers, considering the fixed worker transit models. We solve the TTA problem by considering three variants, step-by-step, from offline to batch-based online versions. The first variant is the offline version of the TTA, which can be reduced to a maximum bipartite matching problem, and be leveraged for the second variant. The second variant is the batch-based online version of the TTA, for which, we propose dividing each batch into an offline version of the TTA problem, along with additional credibility constraints to ensure a certain level of worker response quality. The third variant is the extension of the batch-based online version of the TTA (Flexible-TTA) that relaxes the strict nature of the WTR model and assumes that a task with higher reward than a worker-defined threshold value will convince the worker to stay longer at the transit stop. Through our extensive evaluation, we observe that the algorithm solving the Flexible-TTA problem outperforms the algorithms proposed to solve other variants of the TTA problems, by 55% in terms of the number of assigned tasks, and by at least 35% in terms of average reward for the worker. With respect to the baseline (online task assignment) algorithm, the algorithm solving the Flexible-TTA problem results in three times higher reward and at least three times faster runtime.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116236919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter K. Schwab, Maximilian S. Langohr, K. Meyer-Wegener
{"title":"We Know What You Did Last Session: Policy-Based Query Classification for Data-Privacy Compliance With the DataEconomist","authors":"Peter K. Schwab, Maximilian S. Langohr, K. Meyer-Wegener","doi":"10.1145/3400903.3401692","DOIUrl":"https://doi.org/10.1145/3400903.3401692","url":null,"abstract":"This paper explains the demonstration of the DataEconomist, a framework for policy-based SQL query classification according to data-privacy directives. Our framework automatically derives query meta-information based on query-log analysis and provides user-friendly, graphical interfaces for browsing and filtering queries based on this meta-information. We aim to complement existing data-privacy approaches and enable privacy officers to define domain-specific compliance policy rules based on the graphical filter mechanisms. Policies automatically classify queries as compliant or non-compliant regarding their processing of personal data. During our demonstration, conference attendees assess our system in several scenarios. They filter queries based on various query meta-information, learn how to define compliance policies for automatic query classification without profound technical knowledge, and test this classification by formulating non-compliant queries.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129685976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arrays in Databases: from Underdog to First-Class Citizen","authors":"P. Baumann","doi":"10.1145/3400903.3409500","DOIUrl":"https://doi.org/10.1145/3400903.3409500","url":null,"abstract":"Array Databases close a gap in the database ecosystem by adding modelling, storage, and processing support for multi-dimensional arrays. Built into “datacubes” such structures are known since long in OLAP and statistics, but they also appear as spatio-temporal sensor, image, simulation, and statistics data in all science and engineering domains. In our research we address Array Databases in all aspects, from concepts over architecture to applications. Our full-stack implementation rasdaman (\"raster data manager\"), which effectively has pioneered Array Databases, is in operational use on multi-Petabyte, federated Earth data assets. Based on this experience, the rasdaman team has initiated and shaped datacube standards such as ISO SQL/MDA (Multi-Dimensional Arrays) and the OGC Earth datacube standards suite. In our talk we present concepts and implementation of rasdaman and show its application to Earth datacubes, illustrated by live demonstration of operational services.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117057177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top-k String Similarity Joins","authors":"Shuyao Qi, Panagiotis Bouros, N. Mamoulis","doi":"10.1145/3400903.3400922","DOIUrl":"https://doi.org/10.1145/3400903.3400922","url":null,"abstract":"Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vectorising k-Core Decomposition for GPU Acceleration","authors":"Amir Mehrafsa, S. Chester, Alex Thomo","doi":"10.1145/3400903.3400931","DOIUrl":"https://doi.org/10.1145/3400903.3400931","url":null,"abstract":"k-Core decomposition is a well-studied community detection problem in graph analytics in which each k-core of vertices induces a subgraph where all vertices have degree at least k. The decomposition is expensive to compute on large graphs and efforts to apply massive parallelism have had limited success. This paper presents a vectorisation of the problem that reframes it as a composition of vector primitives on flat, 1d arrays. With such a formulation, we can deploy highly optimised Deep Learning GPU and SIMD frameworks. On a moderate GPU, using PyTorch, we obtain up to 8 × improvement over the best parallel state-of-the-art implemented in C++ and running on an expensive 32-core machine. More importantly, our approach represents a novel abstraction showing that redesigning graph operations as a series of vectorised primitives makes highly-parallel analytics both easier and more accessible for developers. We posit that such an approach can vastly accelerate the use of cheap GPU hardware in complex graph analytics.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133107453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Orderings of Data - More Than a Tripping Hazard: Visionary","authors":"A. Beer, Valentin Hartmann, T. Seidl","doi":"10.1145/3400903.3400911","DOIUrl":"https://doi.org/10.1145/3400903.3400911","url":null,"abstract":"As data processing techniques get more and more sophisticated every day, many of us researchers often get lost in the details and subtleties of the algorithms we are developing and far too easily seem to forget to look also at the very first steps of every algorithm: the input of the data. Since there are plenty of library functions for this task, we indeed do not have to think about this part of the pipeline anymore. But maybe we should. All data is stored and loaded into a program in some order. In this vision paper we study how ignoring this order can (1) lead to performance issues and (2) make research results unreproducible. We furthermore examine desirable properties of a data ordering and why current approaches are often not suited to tackle the two mentioned problems.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130531568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accessible Streaming Algorithms for the Chi-Square Test","authors":"Emily Farrow, Junbo Li, Farhan Zaki, Ashwin Lall","doi":"10.1145/3400903.3400905","DOIUrl":"https://doi.org/10.1145/3400903.3400905","url":null,"abstract":"We present space-efficient algorithms for performing Pearson’s chi-square goodness-of-fit test in a streaming setting. Since the chi-square test is one of the most well known and commonly used tests in statistics, it is surprising that there has been no prior work on designing streaming algorithms for it. The test is not based on a specific distribution assumption and has one-sample and two-sample variants. Given a stream of data, the one-sample variant tests if the stream is drawn from a fixed distribution. The two-sample variant tests if two data streams are drawn from the same or similar distributions. One major advantage of using statistical tests over other quantities commonly measured by streaming algorithms is that these tests do not require parameter tuning and have results that can be easily interpreted by data analysts. The problem that we solve in this paper is how to compute the chi-square test on streams with minimal parameter configuration and assumptions. We give rigorous proofs showing that it is possible to compute the chi-square statistic with high fidelity and an almost quadratic reduction in memory in the continuous case, but the categorical case only admits heuristic solutions. We validate the performance and accuracy of our algorithms through extensive testing on both real and synthetic data sets.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114833297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moditha Hewasinghage, A. Abelló, Jovan Varga, E. Zimányi
{"title":"DocDesign: Cost-Based Database Design for Document Stores","authors":"Moditha Hewasinghage, A. Abelló, Jovan Varga, E. Zimányi","doi":"10.1145/3400903.3401689","DOIUrl":"https://doi.org/10.1145/3400903.3401689","url":null,"abstract":"Document stores have become one of the most popular NoSQL systems, mainly due to their semi-structured data storage structure and well-developed query capabilities. The semi-structured nature allows them to have database designs beyond traditional normalization theories. This makes the database design decisions more complicated with a myriad of possibilities. Thus, the database design process for them has resorted to ad-hoc trial and error methods. However, having a good database design is essential for any data storage system’s performance, and bad design decisions cannot always be compensated by adding more powerful hardware. Thus, in this work, we propose DocDesign, a decision aid tool for document store database design. DocDesign allows its users to evaluate different database designs for data storage requirements under a particular workload. Through DocDesign, users can make informed decisions for a design by evaluating the estimated storage statistics and query runtimes without testing it on an actual document store. DocDesign also generates design specific queries for the input workload. This not only cuts down the time and the effort taken in design decision making and development but also save money spent on fixing poor designs in the long run. On-site, we will showcase how DocDesign facilitates the design decision-making process for MongoDB with both synthetic and real-world examples.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125297731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shared Execution Techniques for Business Data Analytics over Big Data Streams","authors":"Serkan Uzunbaz, Walid G. Aref","doi":"10.1145/3400903.3400932","DOIUrl":"https://doi.org/10.1145/3400903.3400932","url":null,"abstract":"Business Data Analytics require processing of large numbers of data streams and the creation of materialized views in order to provide near real-time answers to user queries. Materializing the view of each query and refreshing it continuously as a separate query execution plan is not efficient and is not scalable. In this paper, we present a global query execution plan to simultaneously support multiple queries, and minimize the number of input scans, operators, and tuples flowing between the operators. We propose shared-execution techniques for creating and maintaining materialized views in support of business data analytics queries. We utilize commonalities in multiple business data analytics queries to support scalable and efficient processing of big data streams. The paper highlights shared execution techniques for select predicates, group, and aggregate calculations. We present how global query execution plans are run in a distributed stream processing system, called INGA which is built on top of Storm. In INGA, we are able to support online view maintenance of 2500 materialized views using 237 queries by utilizing the shared constructs between the queries. We are able to run all 237 queries using a single global query execution plan tree with depth of 21.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131729869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}