Proceedings of the 29th International Conference on Scientific and Statistical Database Management最新文献

筛选
英文 中文
Multi-Hypothesis CSV Parsing 多假设CSV解析
Till Döhmen, H. Mühleisen, P. Boncz
{"title":"Multi-Hypothesis CSV Parsing","authors":"Till Döhmen, H. Mühleisen, P. Boncz","doi":"10.1145/3085504.3085520","DOIUrl":"https://doi.org/10.1145/3085504.3085520","url":null,"abstract":"Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125174911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Dynamic Group Trip Planning Queries in Spatial Databases 空间数据库中的动态团体旅行计划查询
Anika Tabassum, Sukarna Barua, T. Hashem, Tasmin Chowdhury
{"title":"Dynamic Group Trip Planning Queries in Spatial Databases","authors":"Anika Tabassum, Sukarna Barua, T. Hashem, Tasmin Chowdhury","doi":"10.1145/3085504.3085584","DOIUrl":"https://doi.org/10.1145/3085504.3085584","url":null,"abstract":"In this paper, we introduce the concept of \"dynamic groups\" for Group Trip Planning (GTP) queries and propose a novel query type Dynamic Group Trip Planning (DGTP) queries. The traditional GTP query assumes that the group members remain static or fixed during the trip, whereas in the proposed DGTP queries, the group changes dynamically over the duration of a trip where members can leave or join the group at any point of interest (POI) such as a shopping center, a restaurant or a movie theater. The changes of members in a group can be either predetermined (i.e., group changes are known before the trip is planned) or in real-time (changes happen during the trip). In this paper, we provide efficient solutions for processing DGTP queries in the Euclidean space. A comprehensive experimental study using real and synthetic datasets shows that our efficient approach can compute DGTP query solutions within few seconds and significantly outperforms a naive approach in terms of query processing time and I/O access.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125368500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A Unified Correlation-based Approach to Sampling Over Joins 一种统一的基于关联的连接抽样方法
N. Kamat, Arnab Nandi
{"title":"A Unified Correlation-based Approach to Sampling Over Joins","authors":"N. Kamat, Arnab Nandi","doi":"10.1145/3085504.3085524","DOIUrl":"https://doi.org/10.1145/3085504.3085524","url":null,"abstract":"Supporting sampling in the presence of joins is an important problem in data analysis, but is inherently challenging due to the need to avoid correlation between output tuples. Current solutions provide either correlated or non-correlated samples. Sampling might not always be feasible in the non-correlated sampling-based approaches -- the sample size or intermediate data size might be exceedingly large. On the other hand, a correlated sample may not be representative of the join. This paper presents a unified strategy towards join sampling, while considering sample correlation every step of the way. We provide two key contributions. First, in the case where a correlated sample is acceptable, we provide techniques, for all join types, to sample base relations so that their join is as random as possible. Second, in the case where a correlated sample is not acceptable, we provide enhancements to the state-of-the-art algorithms to reduce their execution time and intermediate data size.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127005509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
DataSynthesizer: Privacy-Preserving Synthetic Datasets DataSynthesizer:隐私保护合成数据集
Haoyue Ping, Julia Stoyanovich, Bill Howe
{"title":"DataSynthesizer: Privacy-Preserving Synthetic Datasets","authors":"Haoyue Ping, Julia Stoyanovich, Bill Howe","doi":"10.1145/3085504.3091117","DOIUrl":"https://doi.org/10.1145/3085504.3091117","url":null,"abstract":"To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability --- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules --- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123467243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series 查询建议:允许在多维时间序列中进行直观的交互式搜索
Yifei Ding, Eamonn J. Keogh
{"title":"Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series","authors":"Yifei Ding, Eamonn J. Keogh","doi":"10.1145/3085504.3085522","DOIUrl":"https://doi.org/10.1145/3085504.3085522","url":null,"abstract":"In recent years, the research community, inspired by its success in dealing with single-dimensional time series, has turned its attention to dealing with multidimensional time series. There are now a plethora of techniques for indexing, classification, and clustering of multidimensional time series. However, we argue that the difficulty of exploratory search in large multidimensional time series remains underappreciated. In essence, the problem reduces to the \"chicken-and-egg\" paradox that it is difficult to produce a meaningful query without knowing the best subset of dimensions to use, but finding the best subset of dimensions is itself query dependent. In this work we propose a solution to this problem. We introduce an algorithm that runs in the background, observing the user's search interactions. When appropriate, our algorithm suggests to the user a dimension that could be added or deleted to improve the user's satisfaction with the query. These query dependent suggestions may be useful to the user, even if she does not act on them (by reissuing the query), as they can hint at unexpected relationships or redundancies between the dimensions of the data. We evaluate our algorithm on several real-world datasets in medical, human activity, and industrial domains, showing that it produces subjectively sensible and objectively superior results.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127752741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Challenges of Differentially Private Release of Data Under an Open-world Assumption 开放世界假设下数据差异私密发布的挑战
Elham Naghizade, J. Bailey, L. Kulik, E. Tanin
{"title":"Challenges of Differentially Private Release of Data Under an Open-world Assumption","authors":"Elham Naghizade, J. Bailey, L. Kulik, E. Tanin","doi":"10.1145/3085504.3085531","DOIUrl":"https://doi.org/10.1145/3085504.3085531","url":null,"abstract":"Since its introduction a decade ago, differential privacy has been deployed and adapted in different application scenarios due to its rigorous protection of individuals' privacy regardless of the adversary's background knowledge. An urgent open research issue is how to query/release time evolving datasets in a differentially private manner. Most of the proposed solutions in this area focus on releasing private counters or histograms, which involve low sensitivity, and the main focus of these solutions is minimizing the amount of noise and the utility loss throughout the process. In this paper we consider the case of releasing private numerical values with unbounded sensitivity in a dataset that grows over time. While providing utility bounds for such case is of particular interest, we show that straightforward application of current mechanisms cannot guarantee (differential) privacy for individuals under an open-world assumption where data is continuously being updated, especially if the dataset is updated by an outlier.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114322140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory BLOCK:在主存中有效执行空间范围查询
Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki
{"title":"BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory","authors":"Matthaios Olma, F. Tauheed, T. Heinis, A. Ailamaki","doi":"10.1145/3085504.3085519","DOIUrl":"https://doi.org/10.1145/3085504.3085519","url":null,"abstract":"The execution of spatial range queries is at the core of many applications, particularly in the simulation sciences but also in many other domains. Although main memory in desktop and supercomputers alike has grown considerably in recent years, most spatial indexes supporting the efficient execution of range queries are still only optimized for disk access (minimizing disk page reads). Recent research has primarily focused on the optimization of known disk-based approaches for memory (through cache alignment etc.) but has not fundamentally revisited index structures for memory. In this paper we develop BLOCK, a novel approach to execute range queries on spatial data featuring volumetric objects in main memory. Our approach is built on the key insight that in-memory approaches need to be optimized to reduce the number of intersection tests (between objects and query but also in the index structure). Our experimental results show that BLOCK outperforms known in-memory indexes as well as in-memory implementations of disk-based spatial indexes up to a factor of 7. The experiments show that it is more scalable than competing approaches as the data sets become denser.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125800285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
On-line Versioned Schema Inference for Large Semantic Web Data Sources 大型语义Web数据源的在线版本模式推断
Kenza Kellou-Menouer, Zoubida Kedad
{"title":"On-line Versioned Schema Inference for Large Semantic Web Data Sources","authors":"Kenza Kellou-Menouer, Zoubida Kedad","doi":"10.1145/3085504.3085513","DOIUrl":"https://doi.org/10.1145/3085504.3085513","url":null,"abstract":"A growing number of data sources expressed in RDF(S)/OWL are available on the Web. They are increasingly used in big data and real-time applications. These data sources may be created without formally defining their schema, which is implicit in the stored data. The instances of a source do not have to conform to the schema when it is defined. This offers more flexibility and eases data evolution. However, it comes at the cost of losing the description of the data, which can be useful in many contexts. In this paper, we present SchemaDecrypt, a novel approach for discovering a versioned schema for a remote data source. SchemaDecrypt enables the discovery of the different structures of the existing classes. Our approach discovers the versions on-line, without uploading or browsing the data source. It enables to overcome the source querying restrictions and the combinatorial explosion of the candidate versions. We present some experimental evaluations on DBpedia to demonstrate the performances of our approach.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"59 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VISOR: Visualizing Summaries of Ordered Data VISOR:有序数据的可视化摘要
Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, J. Gamper
{"title":"VISOR: Visualizing Summaries of Ordered Data","authors":"Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, J. Gamper","doi":"10.1145/3085504.3091115","DOIUrl":"https://doi.org/10.1145/3085504.3091115","url":null,"abstract":"In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an ϵ-graph and Δ-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134410831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improving Statistical Similarity Based Data Reduction for Non-Stationary Data 基于统计相似度的非平稳数据约简改进
Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu
{"title":"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data","authors":"Dongeun Lee, A. Sim, Jaesik Choi, Kesheng Wu","doi":"10.1145/3085504.3085583","DOIUrl":"https://doi.org/10.1145/3085504.3085583","url":null,"abstract":"We propose a new class of lossy compression based on locally exchangeable measure that captures the distribution of repeating data blocks while preserving unique patterns. The technique has been demonstrated to reduce data volume by more than 100-fold on power grid monitoring data where a large number of data blocks can be characterized as following stationary probability distributions. To capture data with more diverse patterns, we propose two techniques to transform non-stationary time series into locally stationary blocks. We also propose a strategy to work with values in bounded ranges such as phase angles of alternating current. These new ideas are incorporated into a software package named IDEALEM. In experiments, IDEALEM reduces non-stationary data volume up to 100-fold. Compared with the state-of-the-art lossy compression methods such as SZ, IDEALEM can produce more compact output overall.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122065070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信