Proceedings of the 27th International Conference on Scientific and Statistical Database Management最新文献

筛选
英文 中文
Estimating mutual information on data streams 估计数据流上的互信息
F. Keller, Emmanuel Müller, Klemens Böhm
{"title":"Estimating mutual information on data streams","authors":"F. Keller, Emmanuel Müller, Klemens Böhm","doi":"10.1145/2791347.2791348","DOIUrl":"https://doi.org/10.1145/2791347.2791348","url":null,"abstract":"Mutual information is a well-established and broadly used concept in information theory. It allows to quantify the mutual dependence between two variables -- an essential task in data analysis. For static data, a broad range of techniques addresses the problem of estimating mutual information. However, the assumption of static data is not applicable for today's dynamic data sources such as data streams: In contrast to static approaches, an online estimator must be able to deal with the evolving, changing, and infinite nature of the stream. Furthermore, some tasks require the estimation to be available online while processing the raw data stream. Our proposed solution Mise (Mutual Information Stream Estimation) allows a user to issue mutual information queries in arbitrary time windows. As a key feature, we introduce a novel sampling scheme, which ensures an equal treatment of queries over multiple time scales, e.g., ranging from milliseconds up to decades. We thoroughly analyze the requirements of such a multiscale sampling scheme, and evaluate the resulting quality of Mise in a broad range of experiments.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129108660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Distributed top-k query processing on multi-dimensional data with keywords 基于关键字的多维数据分布式top-k查询处理
Daichi Amagata, T. Hara, S. Nishio
{"title":"Distributed top-k query processing on multi-dimensional data with keywords","authors":"Daichi Amagata, T. Hara, S. Nishio","doi":"10.1145/2791347.2791355","DOIUrl":"https://doi.org/10.1145/2791347.2791355","url":null,"abstract":"As we are in the big data era, techniques for retrieving only user-desirable data objects from massive and diverse datasets is being required. Ranking queries, e.g., top-k queries, which rank data objects based on a user-specified scoring function, enable to find such interesting data for users, and have received significant attention due to its wide range of applications. While many techniques for both centralized and distributed top-k query processing have been developed, they do not consider query keywords, i.e., simply retrieving k data with the best score. Utilizing keywords, on the other hand, is a common approach in data (and information) retrieval. Despite of this fact, there is no study on retrieving top-k data containing all query keywords. We define, in this paper, a new query which enriches the conventional top-k queries, and propose some algorithms to solve the novel problem of how to efficiently retrieve k data objects with the best score and all query from distributed databases. Extensive experiments on both real and synthetic data have demonstrated the efficiency and scalability of our algorithms in terms of communication cost and running time.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131382007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The hyperdyadic index and generalized indexing and query with PIQUE 基于PIQUE的超二元索引与广义索引查询
David A. Boyuka, Houjun Tang, Kushal Bansal, Xiaocheng Zou, S. Klasky, N. Samatova
{"title":"The hyperdyadic index and generalized indexing and query with PIQUE","authors":"David A. Boyuka, Houjun Tang, Kushal Bansal, Xiaocheng Zou, S. Klasky, N. Samatova","doi":"10.1145/2791347.2791374","DOIUrl":"https://doi.org/10.1145/2791347.2791374","url":null,"abstract":"Many scientists rely on indexing and query to identify trends and anomalies within extreme-scale scientific data. Compressed bitmap indexing (e.g., FastBit) is the go-to indexing method for many scientific datasets and query workloads. Recently, the ALACRITY compressed inverted index was shown as a viable alternative approach. Notably, though FastBit and ALACRITY employ very different data structures (inverted list vs. bitmap) and binning methods (bit-wise vs. decimal-precision), close examination reveals marked similarities in index structure. Motivated by this observation, we ask two questions. First, \"Can we generalize FastBit and ALACRITY to an index model encompassing both?\" And second, if so, \"Can such a generalized framework enable other, new indexing methods?\" This paper answers both questions in the affrmative. First, we present PIQUE, a Parallel Indexing and Query Unified Engine, based on formal mathematical decomposition of the indexing process. PIQUE factors out commonalities in indexing, employing algorithmic/data structure \"plugins\" to mix orthogonal indexing concepts such as FastBit compressed bitmaps with ALACRITY binning, all within one framework. Second, we define the hyperdyadic tree index, distinct from both bitmap and inverted indexes, demonstrating good index compression while maintaining high query performance. We implement the hyperdyadic tree index within PIQUE, reinforcing our unified indexing model. We conduct a performance study of the hyperdyadic tree index vs. WAH compressed bitmaps, both within PIQUE and compared to FastBit, a state-of-the-art bitmap index system. The hyperdyadic tree index shows a 1.14-1.90x storage reduction vs. compressed bitmaps, with comparable or better query performance under most scenarios tested.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114613946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
On the internal evaluation of unsupervised outlier detection 论无监督离群值检测的内部评价
Henrique O. Marques, R. Campello, A. Zimek, J. Sander
{"title":"On the internal evaluation of unsupervised outlier detection","authors":"Henrique O. Marques, R. Campello, A. Zimek, J. Sander","doi":"10.1145/2791347.2791352","DOIUrl":"https://doi.org/10.1145/2791347.2791352","url":null,"abstract":"Although there is a large and growing literature that tackles the unsupervised outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature. The so-called internal evaluation, based solely on the data and the assessed solutions themselves, is required if one wants to statistically validate (in absolute terms) or just compare (in relative terms) the solutions provided by different algorithms or by different parameterizations of a given algorithm in the absence of labeled data. However, in contrast to unsupervised cluster analysis, where indexes for internal evaluation and validation of clustering solutions have been conceived and shown to be very useful, in the outlier detection domain this problem has been notably overlooked. Here we discuss this problem and provide a solution for the internal evaluation of top-n (binary) outlier detection results. Specifically, we propose an index called IREOS (Internal, Relative Evaluation of Outlier Solutions) that can evaluate and compare different candidate labelings of a collection of multivariate observations in terms of outliers and inliers. We also statistically adjust IREOS for chance and extensively evaluate it in several experiments involving different collections of synthetic and real data sets.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122231729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Querying RDF data with text annotated graphs 使用带文本注释的图查询RDF数据
Lushan Han, Timothy W. Finin, A. Joshi, D. Cheng
{"title":"Querying RDF data with text annotated graphs","authors":"Lushan Han, Timothy W. Finin, A. Joshi, D. Cheng","doi":"10.1145/2791347.2791381","DOIUrl":"https://doi.org/10.1145/2791347.2791381","url":null,"abstract":"Scientists and casual users need better ways to query RDF databases or Linked Open Data. Using the SPARQL query language requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology used, and URIs for entities of interest. Natural language query systems are a powerful approach, but current techniques are brittle in addressing the ambiguity and complexity of natural language and require expensive labor to supply the extensive domain knowledge they need. We introduce a compromise in which users give a graphical \"skeleton\" for a query and annotates it with freely chosen words, phrases and entity names. We describe a framework for interpreting these \"schema-agnostic queries\" over open domain RDF data that automatically translates them to SPARQL queries. The framework uses semantic textual similarity to find mapping candidates and uses statistical approaches to learn domain knowledge for disambiguation, thus avoiding expensive human efforts required by natural language interface systems. We demonstrate the feasibility of the approach with an implementation that performs well in an evaluation on DBpedia data.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"519 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116703614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving performance of similarity measures for uncertain time series using preprocessing techniques 利用预处理技术改进不确定时间序列相似性度量的性能
M. Orang, Nematollaah Shiri
{"title":"Improving performance of similarity measures for uncertain time series using preprocessing techniques","authors":"M. Orang, Nematollaah Shiri","doi":"10.1145/2791347.2791385","DOIUrl":"https://doi.org/10.1145/2791347.2791385","url":null,"abstract":"We study the impact of preprocessing techniques on performance and effectiveness of the similarity measures for uncertain time series. Some existing work on uncertain time series use the same similarity measures developed for standard time series, to which we refer as traditional similarity measures. More recently, a number of new similarity measures have been proposed for uncertain time series, to which we refer as uncertain similarity measures. However, they have been shown not to be as effective as the traditional measures. In this work, we show that the performance of uncertain similarity measures can be improved through preprocessing techniques. We establish this through extensive experiments using the UCR benchmark data. Our results in fact indicate that the uncertain similarity measures together with preprocessing outperform the traditional similarity measures.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"9 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124665322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Efficient iterative processing in the SciDB parallel array engine SciDB并行阵列引擎的高效迭代处理
Emad Soroush, M. Balazinska, S. Krughoff, A. Connolly
{"title":"Efficient iterative processing in the SciDB parallel array engine","authors":"Emad Soroush, M. Balazinska, S. Krughoff, A. Connolly","doi":"10.1145/2791347.2791362","DOIUrl":"https://doi.org/10.1145/2791347.2791362","url":null,"abstract":"Many scientific data-intensive applications perform iterative computations on array data. There exist multiple engines specialized for array processing. These engines efficiently support various types of operations, but none includes native support for iterative processing. In this paper, we develop a model for iterative array computations and a series of optimizations. We evaluate the benefits of an optimized, native support for iterative array processing on the SciDB engine and real workloads from the astronomy domain.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131770454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
GRAPHITE: an extensible graph traversal framework for relational database management systems 用于关系数据库管理系统的可扩展图遍历框架
M. Paradies, Wolfgang Lehner, Christof Bornhövd
{"title":"GRAPHITE: an extensible graph traversal framework for relational database management systems","authors":"M. Paradies, Wolfgang Lehner, Christof Bornhövd","doi":"10.1145/2791347.2791383","DOIUrl":"https://doi.org/10.1145/2791347.2791383","url":null,"abstract":"Graph traversals are a basic but fundamental ingredient for a variety of graph algorithms and graph-oriented queries. To achieve the best possible query performance, they need to be implemented at the core of a database management system that aims at storing, manipulating, and querying graph data. Increasingly, modern business applications demand native graph query and processing capabilities for enterprise-critical operations on data stored in relational database management systems. In this paper we propose an extensible graph traversal framework (GRAPHITE) as a central graph processing component on a common storage engine inside a relational database management system. We study the influence of the graph topology on the execution time of graph traversals and derive two traversal algorithm implementations specialized for different graph topologies and traversal queries. We conduct extensive experiments on GRAPHITE for a large variety of real-world graph data sets and input configurations. Our experiments show that the proposed traversal algorithms differ by up to two orders of magnitude for different input configurations and therefore demonstrate the need for a versatile framework to efficiently process graph traversals on a wide range of different graph topologies and types of queries. Finally, we highlight that the query performance of our traversal implementations is competitive with those of two native graph database management systems.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131008642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
RITA: an index-tuning advisor for replicated databases RITA:用于复制数据库的索引调优顾问
Quoc Trung Tran, I. Jimenez, Rui Wang, N. Polyzotis, A. Ailamaki
{"title":"RITA: an index-tuning advisor for replicated databases","authors":"Quoc Trung Tran, I. Jimenez, Rui Wang, N. Polyzotis, A. Ailamaki","doi":"10.1145/2791347.2791376","DOIUrl":"https://doi.org/10.1145/2791347.2791376","url":null,"abstract":"Given a replicated database, a divergent design tunes the indexes in each replica differently in order to specialize it for a specific subset of the workload. Empirical studies have shown that this specialization brings significant performance gains compared to the common practice of having the same indexes in all replicas. However, reaping the benefits of divergent designs requires the development of new tuning tools for database administrators, and the existing tools unfortunately suffer from severe shortcomings: they assume a fixed number of replicas and a known workload distribution, and ignore the possibility of replica failures and the subsequent effect on load imbalance. To address these shortcomings, we analyze the theory and practice of tuning the divergent design of a replicated database. We design and implement RITA, a novel divergent-tuning advisor that offers several essential features not found in existing tools: (1) it generates robust divergent designs that allow the system to adapt gracefully to replica failures; (2) it computes designs that spread the load evenly among specialized replicas, both during normal operation and when replicas fail; (3) it monitors the workload online in order to detect changes that require a recomputation of the divergent design; and, (4) it offers suggestions to elastically reconfigure the system (by adding/removing replicas or adding/dropping indexes) to respond to workload changes. The key technical innovation in this paper is the formulation the problem of selecting an optimal design as a Binary Integer Program (BIP). The BIP has a relatively small number of variables, thereby enabling an efficient solution using any off-the-shelf linear-optimization software. Experimental results demonstrate that RITA improves on the performance of the computed designs of existing tools by a factor of up to three, and at the same time has a low runtime overhead that enables fast tuning sessions.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"20 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120874374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Proceedings of the 27th International Conference on Scientific and Statistical Database Management 第27届科学与统计数据库管理国际会议论文集
{"title":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","authors":"","doi":"10.1145/2791347","DOIUrl":"https://doi.org/10.1145/2791347","url":null,"abstract":"","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122700369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信