{"title":"Multi-Hypothesis CSV Parsing","authors":"Till Döhmen, H. Mühleisen, P. Boncz","doi":"10.1145/3085504.3085520","DOIUrl":"https://doi.org/10.1145/3085504.3085520","url":null,"abstract":"Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125174911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anika Tabassum, Sukarna Barua, T. Hashem, Tasmin Chowdhury
{"title":"Dynamic Group Trip Planning Queries in Spatial Databases","authors":"Anika Tabassum, Sukarna Barua, T. Hashem, Tasmin Chowdhury","doi":"10.1145/3085504.3085584","DOIUrl":"https://doi.org/10.1145/3085504.3085584","url":null,"abstract":"In this paper, we introduce the concept of \"dynamic groups\" for Group Trip Planning (GTP) queries and propose a novel query type Dynamic Group Trip Planning (DGTP) queries. The traditional GTP query assumes that the group members remain static or fixed during the trip, whereas in the proposed DGTP queries, the group changes dynamically over the duration of a trip where members can leave or join the group at any point of interest (POI) such as a shopping center, a restaurant or a movie theater. The changes of members in a group can be either predetermined (i.e., group changes are known before the trip is planned) or in real-time (changes happen during the trip). In this paper, we provide efficient solutions for processing DGTP queries in the Euclidean space. A comprehensive experimental study using real and synthetic datasets shows that our efficient approach can compute DGTP query solutions within few seconds and significantly outperforms a naive approach in terms of query processing time and I/O access.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125368500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Unified Correlation-based Approach to Sampling Over Joins","authors":"N. Kamat, Arnab Nandi","doi":"10.1145/3085504.3085524","DOIUrl":"https://doi.org/10.1145/3085504.3085524","url":null,"abstract":"Supporting sampling in the presence of joins is an important problem in data analysis, but is inherently challenging due to the need to avoid correlation between output tuples. Current solutions provide either correlated or non-correlated samples. Sampling might not always be feasible in the non-correlated sampling-based approaches -- the sample size or intermediate data size might be exceedingly large. On the other hand, a correlated sample may not be representative of the join. This paper presents a unified strategy towards join sampling, while considering sample correlation every step of the way. We provide two key contributions. First, in the case where a correlated sample is acceptable, we provide techniques, for all join types, to sample base relations so that their join is as random as possible. Second, in the case where a correlated sample is not acceptable, we provide enhancements to the state-of-the-art algorithms to reduce their execution time and intermediate data size.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127005509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DataSynthesizer: Privacy-Preserving Synthetic Datasets","authors":"Haoyue Ping, Julia Stoyanovich, Bill Howe","doi":"10.1145/3085504.3091117","DOIUrl":"https://doi.org/10.1145/3085504.3091117","url":null,"abstract":"To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability --- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules --- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series","authors":"Yifei Ding, Eamonn J. Keogh","doi":"10.1145/3085504.3085522","DOIUrl":"https://doi.org/10.1145/3085504.3085522","url":null,"abstract":"In recent years, the research community, inspired by its success in dealing with single-dimensional time series, has turned its attention to dealing with multidimensional time series. There are now a plethora of techniques for indexing, classification, and clustering of multidimensional time series. However, we argue that the difficulty of exploratory search in large multidimensional time series remains underappreciated. In essence, the problem reduces to the \"chicken-and-egg\" paradox that it is difficult to produce a meaningful query without knowing the best subset of dimensions to use, but finding the best subset of dimensions is itself query dependent. In this work we propose a solution to this problem. We introduce an algorithm that runs in the background, observing the user's search interactions. When appropriate, our algorithm suggests to the user a dimension that could be added or deleted to improve the user's satisfaction with the query. These query dependent suggestions may be useful to the user, even if she does not act on them (by reissuing the query), as they can hint at unexpected relationships or redundancies between the dimensions of the data. We evaluate our algorithm on several real-world datasets in medical, human activity, and industrial domains, showing that it produces subjectively sensible and objectively superior results.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges of Differentially Private Release of Data Under an Open-world Assumption","authors":"Elham Naghizade, J. Bailey, L. Kulik, E. Tanin","doi":"10.1145/3085504.3085531","DOIUrl":"https://doi.org/10.1145/3085504.3085531","url":null,"abstract":"Since its introduction a decade ago, differential privacy has been deployed and adapted in different application scenarios due to its rigorous protection of individuals' privacy regardless of the adversary's background knowledge. An urgent open research issue is how to query/release time evolving datasets in a differentially private manner. Most of the proposed solutions in this area focus on releasing private counters or histograms, which involve low sensitivity, and the main focus of these solutions is minimizing the amount of noise and the utility loss throughout the process. In this paper we consider the case of releasing private numerical values with unbounded sensitivity in a dataset that grows over time. While providing utility bounds for such case is of particular interest, we show that straightforward application of current mechanisms cannot guarantee (differential) privacy for individuals under an open-world assumption where data is continuously being updated, especially if the dataset is updated by an outlier.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114322140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki
{"title":"BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory","authors":"Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki","doi":"10.1145/3085504.3085519","DOIUrl":"https://doi.org/10.1145/3085504.3085519","url":null,"abstract":"The execution of spatial range queries is at the core of many applications, particularly in the simulation sciences but also in many other domains. Although main memory in desktop and supercomputers alike has grown considerably in recent years, most spatial indexes supporting the efficient execution of range queries are still only optimized for disk access (minimizing disk page reads). Recent research has primarily focused on the optimization of known disk-based approaches for memory (through cache alignment etc.) but has not fundamentally revisited index structures for memory. In this paper we develop BLOCK, a novel approach to execute range queries on spatial data featuring volumetric objects in main memory. Our approach is built on the key insight that in-memory approaches need to be optimized to reduce the number of intersection tests (between objects and query but also in the index structure). Our experimental results show that BLOCK outperforms known in-memory indexes as well as in-memory implementations of disk-based spatial indexes up to a factor of 7. The experiments show that it is more scalable than competing approaches as the data sets become denser.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125800285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-line Versioned Schema Inference for Large Semantic Web Data Sources","authors":"Kenza Kellou-Menouer, Zoubida Kedad","doi":"10.1145/3085504.3085513","DOIUrl":"https://doi.org/10.1145/3085504.3085513","url":null,"abstract":"A growing number of data sources expressed in RDF(S)/OWL are available on the Web. They are increasingly used in big data and real-time applications. These data sources may be created without formally defining their schema, which is implicit in the stored data. The instances of a source do not have to conform to the schema when it is defined. This offers more flexibility and eases data evolution. However, it comes at the cost of losing the description of the data, which can be useful in many contexts. In this paper, we present SchemaDecrypt, a novel approach for discovering a versioned schema for a remote data source. SchemaDecrypt enables the discovery of the different structures of the existing classes. Our approach discovers the versions on-line, without uploading or browsing the data source. It enables to overcome the source querying restrictions and the combinatorial explosion of the candidate versions. We present some experimental evaluations on DBpedia to demonstrate the performances of our approach.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"59 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, J. Gamper
{"title":"VISOR: Visualizing Summaries of Ordered Data","authors":"Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, J. Gamper","doi":"10.1145/3085504.3091115","DOIUrl":"https://doi.org/10.1145/3085504.3091115","url":null,"abstract":"In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an ϵ-graph and Δ-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134410831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data","authors":"Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu","doi":"10.1145/3085504.3085583","DOIUrl":"https://doi.org/10.1145/3085504.3085583","url":null,"abstract":"We propose a new class of lossy compression based on locally exchangeable measure that captures the distribution of repeating data blocks while preserving unique patterns. The technique has been demonstrated to reduce data volume by more than 100-fold on power grid monitoring data where a large number of data blocks can be characterized as following stationary probability distributions. To capture data with more diverse patterns, we propose two techniques to transform non-stationary time series into locally stationary blocks. We also propose a strategy to work with values in bounded ranges such as phase angles of alternating current. These new ideas are incorporated into a software package named IDEALEM. In experiments, IDEALEM reduces non-stationary data volume up to 100-fold. Compared with the state-of-the-art lossy compression methods such as SZ, IDEALEM can produce more compact output overall.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122065070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}