Proceedings of the 29th International Conference on Scientific and Statistical Database Management最新文献

Multi-Hypothesis CSV Parsing 多假设CSV解析

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085520

Till Döhmen, H. Mühleisen, P. Boncz

{"title":"Multi-Hypothesis CSV Parsing","authors":"Till Döhmen, H. Mühleisen, P. Boncz","doi":"10.1145/3085504.3085520","DOIUrl":"https://doi.org/10.1145/3085504.3085520","url":null,"abstract":"Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125174911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Dynamic Group Trip Planning Queries in Spatial Databases 空间数据库中的动态团体旅行计划查询

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085584

Anika Tabassum, Sukarna Barua, T. Hashem, Tasmin Chowdhury

引用次数: 16

A Unified Correlation-based Approach to Sampling Over Joins 一种统一的基于关联的连接抽样方法

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085524

N. Kamat, Arnab Nandi

引用次数: 8

DataSynthesizer: Privacy-Preserving Synthetic Datasets DataSynthesizer:隐私保护合成数据集

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3091117

Haoyue Ping, Julia Stoyanovich, Bill Howe

{"title":"DataSynthesizer: Privacy-Preserving Synthetic Datasets","authors":"Haoyue Ping, Julia Stoyanovich, Bill Howe","doi":"10.1145/3085504.3091117","DOIUrl":"https://doi.org/10.1145/3085504.3091117","url":null,"abstract":"To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability --- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules --- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96

Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series 查询建议:允许在多维时间序列中进行直观的交互式搜索

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085522

Yifei Ding, Eamonn J. Keogh

{"title":"Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series","authors":"Yifei Ding, Eamonn J. Keogh","doi":"10.1145/3085504.3085522","DOIUrl":"https://doi.org/10.1145/3085504.3085522","url":null,"abstract":"In recent years, the research community, inspired by its success in dealing with single-dimensional time series, has turned its attention to dealing with multidimensional time series. There are now a plethora of techniques for indexing, classification, and clustering of multidimensional time series. However, we argue that the difficulty of exploratory search in large multidimensional time series remains underappreciated. In essence, the problem reduces to the \"chicken-and-egg\" paradox that it is difficult to produce a meaningful query without knowing the best subset of dimensions to use, but finding the best subset of dimensions is itself query dependent. In this work we propose a solution to this problem. We introduce an algorithm that runs in the background, observing the user's search interactions. When appropriate, our algorithm suggests to the user a dimension that could be added or deleted to improve the user's satisfaction with the query. These query dependent suggestions may be useful to the user, even if she does not act on them (by reissuing the query), as they can hint at unexpected relationships or redundancies between the dimensions of the data. We evaluate our algorithm on several real-world datasets in medical, human activity, and industrial domains, showing that it produces subjectively sensible and objectively superior results.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Challenges of Differentially Private Release of Data Under an Open-world Assumption 开放世界假设下数据差异私密发布的挑战

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085531

Elham Naghizade, J. Bailey, L. Kulik, E. Tanin

引用次数: 3

BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory BLOCK:在主存中有效执行空间范围查询

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085519

Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki

{"title":"BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory","authors":"Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki","doi":"10.1145/3085504.3085519","DOIUrl":"https://doi.org/10.1145/3085504.3085519","url":null,"abstract":"The execution of spatial range queries is at the core of many applications, particularly in the simulation sciences but also in many other domains. Although main memory in desktop and supercomputers alike has grown considerably in recent years, most spatial indexes supporting the efficient execution of range queries are still only optimized for disk access (minimizing disk page reads). Recent research has primarily focused on the optimization of known disk-based approaches for memory (through cache alignment etc.) but has not fundamentally revisited index structures for memory. In this paper we develop BLOCK, a novel approach to execute range queries on spatial data featuring volumetric objects in main memory. Our approach is built on the key insight that in-memory approaches need to be optimized to reduce the number of intersection tests (between objects and query but also in the index structure). Our experimental results show that BLOCK outperforms known in-memory indexes as well as in-memory implementations of disk-based spatial indexes up to a factor of 7. The experiments show that it is more scalable than competing approaches as the data sets become denser.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125800285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

On-line Versioned Schema Inference for Large Semantic Web Data Sources 大型语义Web数据源的在线版本模式推断

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085513

Kenza Kellou-Menouer, Zoubida Kedad

引用次数: 6

VISOR: Visualizing Summaries of Ordered Data VISOR:有序数据的可视化摘要

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3091115

Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, J. Gamper

引用次数: 6

Improving Statistical Similarity Based Data Reduction for Non-Stationary Data 基于统计相似度的非平稳数据约简改进

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI: 10.1145/3085504.3085583

Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu

引用次数: 4