32nd International Conference on Scientific and Statistical Database Management最新文献_第2页

Cohesive Subgraph Detection in Large Bipartite Networks 大型二部网络中的内聚子图检测

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400925

Y. Hao, Mengqi Zhang, Xiaoyang Wang, Chen Chen

引用次数: 9

Transit-based Task Assignment in Spatial Crowdsourcing 空间众包中基于交通的任务分配

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400929

S. Gummidi, T. Pedersen, Xike Xie

{"title":"Transit-based Task Assignment in Spatial Crowdsourcing","authors":"S. Gummidi, T. Pedersen, Xike Xie","doi":"10.1145/3400903.3400929","DOIUrl":"https://doi.org/10.1145/3400903.3400929","url":null,"abstract":"Worker movement information can help the spatial crowdsourcing platform to identify the right time to assign a task to a worker for successful completion of the task. However, the majority of the current assignment strategies do not consider worker movement information. This paper aims to utilize the worker movement information via transits in an online task assignment setting. The idea is to harness the waiting periods at different transit stops in a worker transit route (WTR) for performing the tasks. Given the limited availability of workers’ waiting periods at transit stops, task deadlines and workers’ preference of performing tasks with higher rewards, we define the Transit-based Task Assignment (TTA) problem. The objective of the TTA problem is to maximize the average worker rewards for motivating workers, considering the fixed worker transit models. We solve the TTA problem by considering three variants, step-by-step, from offline to batch-based online versions. The first variant is the offline version of the TTA, which can be reduced to a maximum bipartite matching problem, and be leveraged for the second variant. The second variant is the batch-based online version of the TTA, for which, we propose dividing each batch into an offline version of the TTA problem, along with additional credibility constraints to ensure a certain level of worker response quality. The third variant is the extension of the batch-based online version of the TTA (Flexible-TTA) that relaxes the strict nature of the WTR model and assumes that a task with higher reward than a worker-defined threshold value will convince the worker to stay longer at the transit stop. Through our extensive evaluation, we observe that the algorithm solving the Flexible-TTA problem outperforms the algorithms proposed to solve other variants of the TTA problems, by 55% in terms of the number of assigned tasks, and by at least 35% in terms of average reward for the worker. With respect to the baseline (online task assignment) algorithm, the algorithm solving the Flexible-TTA problem results in three times higher reward and at least three times faster runtime.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116236919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

We Know What You Did Last Session: Policy-Based Query Classification for Data-Privacy Compliance With the DataEconomist 我们知道你上节课做了什么:数据经济学家基于策略的数据隐私遵从查询分类

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3401692

Peter K. Schwab, Maximilian S. Langohr, K. Meyer-Wegener

引用次数: 2

Arrays in Databases: from Underdog to First-Class Citizen 数据库中的数组:从失败者到一流公民

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3409500

P. Baumann

引用次数: 0

Top-k String Similarity Joins Top-k字符串相似连接

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400922

Shuyao Qi, Panagiotis Bouros, N. Mamoulis

{"title":"Top-k String Similarity Joins","authors":"Shuyao Qi, Panagiotis Bouros, N. Mamoulis","doi":"10.1145/3400903.3400922","DOIUrl":"https://doi.org/10.1145/3400903.3400922","url":null,"abstract":"Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by a monotone aggregate function γ (e.g., SUM). Such a join operation finds application in data integration, data cleaning and de-duplication scenarios, and in emerging scientific fields such as bioinformatics. We investigate how existing top-k join methods can be adapted and optimized for , taking into account the semantics and the special characteristics of string similarity joins. We present techniques to avoid computing the entire string join and indexing that enables pruning candidates with respect to both the string join and the ranking component of the query. Our extensive experimental analysis demonstrates the efficiency of our methodology for by comparing solutions that either prioritize the ranking/join component or are able to handle both components of the query at the same time.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vectorising k-Core Decomposition for GPU Acceleration 面向GPU加速的矢量化k核分解

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400931

Amir Mehrafsa, S. Chester, Alex Thomo

引用次数: 4

Orderings of Data - More Than a Tripping Hazard: Visionary 数据排序-不仅仅是一个绊倒的危险:有远见的

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400911

A. Beer, Valentin Hartmann, T. Seidl

引用次数: 0

Accessible Streaming Algorithms for the Chi-Square Test 卡方检验的可访问流算法

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400905

Emily Farrow, Junbo Li, Farhan Zaki, Ashwin Lall

{"title":"Accessible Streaming Algorithms for the Chi-Square Test","authors":"Emily Farrow, Junbo Li, Farhan Zaki, Ashwin Lall","doi":"10.1145/3400903.3400905","DOIUrl":"https://doi.org/10.1145/3400903.3400905","url":null,"abstract":"We present space-efficient algorithms for performing Pearson’s chi-square goodness-of-fit test in a streaming setting. Since the chi-square test is one of the most well known and commonly used tests in statistics, it is surprising that there has been no prior work on designing streaming algorithms for it. The test is not based on a specific distribution assumption and has one-sample and two-sample variants. Given a stream of data, the one-sample variant tests if the stream is drawn from a fixed distribution. The two-sample variant tests if two data streams are drawn from the same or similar distributions. One major advantage of using statistical tests over other quantities commonly measured by streaming algorithms is that these tests do not require parameter tuning and have results that can be easily interpreted by data analysts. The problem that we solve in this paper is how to compute the chi-square test on streams with minimal parameter configuration and assumptions. We give rigorous proofs showing that it is possible to compute the chi-square statistic with high fidelity and an almost quadratic reduction in memory in the continuous case, but the categorical case only admits heuristic solutions. We validate the performance and accuracy of our algorithms through extensive testing on both real and synthetic data sets.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114833297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

DocDesign: Cost-Based Database Design for Document Stores 文档存储的基于成本的数据库设计

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3401689

Moditha Hewasinghage, A. Abelló, Jovan Varga, E. Zimányi

{"title":"DocDesign: Cost-Based Database Design for Document Stores","authors":"Moditha Hewasinghage, A. Abelló, Jovan Varga, E. Zimányi","doi":"10.1145/3400903.3401689","DOIUrl":"https://doi.org/10.1145/3400903.3401689","url":null,"abstract":"Document stores have become one of the most popular NoSQL systems, mainly due to their semi-structured data storage structure and well-developed query capabilities. The semi-structured nature allows them to have database designs beyond traditional normalization theories. This makes the database design decisions more complicated with a myriad of possibilities. Thus, the database design process for them has resorted to ad-hoc trial and error methods. However, having a good database design is essential for any data storage system’s performance, and bad design decisions cannot always be compensated by adding more powerful hardware. Thus, in this work, we propose DocDesign, a decision aid tool for document store database design. DocDesign allows its users to evaluate different database designs for data storage requirements under a particular workload. Through DocDesign, users can make informed decisions for a design by evaluating the estimated storage statistics and query runtimes without testing it on an actual document store. DocDesign also generates design specific queries for the input workload. This not only cuts down the time and the effort taken in design decision making and development but also save money spent on fixing poor designs in the long run. On-site, we will showcase how DocDesign facilitates the design decision-making process for MongoDB with both synthetic and real-world examples.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125297731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Shared Execution Techniques for Business Data Analytics over Big Data Streams 大数据流上业务数据分析的共享执行技术

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-07-07 DOI: 10.1145/3400903.3400932

Serkan Uzunbaz, Walid G. Aref

{"title":"Shared Execution Techniques for Business Data Analytics over Big Data Streams","authors":"Serkan Uzunbaz, Walid G. Aref","doi":"10.1145/3400903.3400932","DOIUrl":"https://doi.org/10.1145/3400903.3400932","url":null,"abstract":"Business Data Analytics require processing of large numbers of data streams and the creation of materialized views in order to provide near real-time answers to user queries. Materializing the view of each query and refreshing it continuously as a separate query execution plan is not efficient and is not scalable. In this paper, we present a global query execution plan to simultaneously support multiple queries, and minimize the number of input scans, operators, and tuples flowing between the operators. We propose shared-execution techniques for creating and maintaining materialized views in support of business data analytics queries. We utilize commonalities in multiple business data analytics queries to support scalable and efficient processing of big data streams. The paper highlights shared execution techniques for select predicates, group, and aggregate calculations. We present how global query execution plans are run in a distributed stream processing system, called INGA which is built on top of Storm. In INGA, we are able to support online view maintenance of 2500 materialized views using 237 queries by utilizing the shared constructs between the queries. We are able to run all 237 queries using a single global query execution plan tree with depth of 21.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131729869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1