Proceedings of the 25th International Conference on Scientific and Statistical Database Management最新文献

pcApriori: scalable apriori for multiprocessor systems pcApriori:多处理器系统的可伸缩先验

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484879

B. Schlegel, Tim Kiefer, T. Kissinger, Wolfgang Lehner

{"title":"pcApriori: scalable apriori for multiprocessor systems","authors":"B. Schlegel, Tim Kiefer, T. Kissinger, Wolfgang Lehner","doi":"10.1145/2484838.2484879","DOIUrl":"https://doi.org/10.1145/2484838.2484879","url":null,"abstract":"Frequent-itemset mining is an important part of data mining. It is a computational and memory intensive task and has a large number of scientific and statistical application areas. In many of them, the datasets can easily grow up to tens or even several hundred gigabytes of data. Hence, efficient algorithms are required to process such amounts of data. In the recent years, there have been proposed many efficient sequential mining algorithms, which however cannot exploit current and future systems providing large degrees of parallelism. Contrary, the number of parallel frequent-itemset mining algorithms is rather small and most of them do not scale well as the number of threads is largely increased. In this paper, we present a highly-scalable mining algorithm that is based on the well-known Apriori algorithm; it is optimized for processing very large datasets on multiprocessor systems. The key idea of pcApriori is to employ a modified producer--consumer processing scheme, which partitions the data during processing and distributes it to the available threads. We conduct many experiments on large datasets. pcApriori scales almost linear on our test system comprising 32 cores.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multi-scale dissemination of time series data 时间序列数据的多尺度传播

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484878

Qingsong Guo, Yongluan Zhou, Li Su

引用次数: 1

Graywulf: a platform for federated scientific databases and services 一个联合科学数据库和服务的平台

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484863

L. Dobos, I. Csabai, A. Szalay, T. Budavári, Nolan Li

{"title":"Graywulf: a platform for federated scientific databases and services","authors":"L. Dobos, I. Csabai, A. Szalay, T. Budavári, Nolan Li","doi":"10.1145/2484838.2484863","DOIUrl":"https://doi.org/10.1145/2484838.2484863","url":null,"abstract":"Many fields of science rely on relational database management systems to analyze, publish and share data. Since RDBMS are originally designed for, and their development directions are primarily driven by, business use cases they often lack features very important for scientific applications. Horizontal scalability is probably the most important missing feature which makes it challenging to adapt traditional relational database systems to the ever growing data sizes. Due to the limited support of array data types and metadata management, successful application of RDBMS in science usually requires the development of custom extensions. While some of these extensions are specific to the field of science, the majority of them could easily be generalized and reused in other disciplines. With the Graywulf project we intend to target several goals. We are building a generic platform that offers reusable components for efficient storage, transformation, statistical analysis and presentation of scientific data stored in Microsoft SQL Server. Graywulf also addresses the distributed computational issues arising from current RDBMS technologies. The current version supports load balancing of simple queries and parallel execution of partitioned queries over a set of mirrored databases. Uniform user access to the data is provided through a web based query interface and a data surface for software clients. Queries are formulated in a slightly modified syntax of SQL that offers a transparent view of the distributed data. The software library consists of several components that can be reused to develop complex scientific data warehouses: a system registry, administration tools to manage entire database server clusters, a sophisticated workflow execution framework, and a SQL parser library.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122953006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Accelerating gene context analysis using bitmaps 使用位图加速基因上下文分析

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484856

A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis

{"title":"Accelerating gene context analysis using bitmaps","authors":"A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis","doi":"10.1145/2484838.2484856","DOIUrl":"https://doi.org/10.1145/2484838.2484856","url":null,"abstract":"Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called \"gene cassettes\", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer. We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a \"killer query\", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer \"killer queries\" in seconds. Even with an extremely complex \"killer query\" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115382102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

RMiCS: a robust approach for mining coherent subgraphs in edge-labeled multi-layer graphs rmic:一种在边缘标记的多层图中挖掘相干子图的鲁棒方法

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484860

Brigitte Boden, Stephan Günnemann, H. Hoffmann, T. Seidl

{"title":"RMiCS: a robust approach for mining coherent subgraphs in edge-labeled multi-layer graphs","authors":"Brigitte Boden, Stephan Günnemann, H. Hoffmann, T. Seidl","doi":"10.1145/2484838.2484860","DOIUrl":"https://doi.org/10.1145/2484838.2484860","url":null,"abstract":"Detecting dense subgraphs in a large graph is an important graph mining problem and various approaches have been proposed for its solution. While most existing methods only consider unlabeled and one-dimensional graph data, many real-world applications provide far richer information. Thus, in our work, we consider graphs that contain different types of edges -- represented as different layers/dimensions of a graph -- as well as edge labels that further characterize the relations between two vertices. We argue that exploiting this additional information supports the detection of more interesting clusters. In general, we aim at detecting clusters of vertices that are densely connected by edges with similar labels in subsets of the graph layers. So far, there exists only a single method that tries to detect clusters in such graphs. This method, however, is highly sensitive to noise: already a single edge with a deviating label can completely hinder the detection of interesting clusters. In this paper, we present the RCS (Robust Coherent Subgraph) model which enables us to detect clusters even in noisy data. This robustness greatly enhances the applicability on real-world data. In order to obtain interpretable results, RCS avoids redundant clusters in the result set. We present the algorithm RMiCS for an efficient detection of RCS clusters and we analyze its behavior in various experiments on synthetic and real-world data.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123645571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Reasoning about RFID-tracked moving objects in symbolic indoor spaces 关于在象征性室内空间中rfid跟踪移动物体的推理

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484877

Sari Haj Hussein, Hua Lu, T. Pedersen

{"title":"Reasoning about RFID-tracked moving objects in symbolic indoor spaces","authors":"Sari Haj Hussein, Hua Lu, T. Pedersen","doi":"10.1145/2484838.2484877","DOIUrl":"https://doi.org/10.1145/2484838.2484877","url":null,"abstract":"In recent years, indoor spatial data management has started to attract attention, partly due to the increasing use of receptor devices (e.g., RFID readers, and wireless sensor networks) in indoor, as well as outdoor spaces. There is thus a great need for a model that captures such spaces, their receptors, and provides powerful reasoning techniques on top. This paper reviews and extends a recent unified model of outdoor and indoor spaces and receptor deployments in these spaces. The extended model enables modelers to capture various information pieces from the physical world. On top of the extended model, this paper proposes and formalizes the route observability concept, and demonstrates its usefulness in enhancing the reading environment. The extended model also enables incorporating receptor data through a probabilistic trajectory-to-route translator. This translator first facilitates the tracking of moving objects enabling the search for them to be optimized, and second supports high-level reasoning about points of potential traffic (over)load, so-called bottleneck points. The functional analysis illustrates the behavior of the route observability function. The experimental evaluation shows the accuracy of the translator, and the quality of the inference and reasoning. The experiments are conducted on both synthetic data and uncleansed, real-world data obtained from RFID-tagged flight baggage.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"2 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115006228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

SMIX: self-managing indexes for dynamic workloads SMIX:动态工作负载的自管理索引

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484862

H. Voigt, T. Kissinger, Wolfgang Lehner

引用次数: 19

Making sense of big data with the Berkeley data analytics stack 利用伯克利数据分析堆栈理解大数据

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484884

M. Franklin

{"title":"Making sense of big data with the Berkeley data analytics stack","authors":"M. Franklin","doi":"10.1145/2484838.2484884","DOIUrl":"https://doi.org/10.1145/2484838.2484884","url":null,"abstract":"The Berkeley AMPLab was founded on the idea that the challenges of emerging Big Data applications require a new approach to analytics systems. Launching in early 2011, the project set out to rethink the traditional analytics stack, breaking down technical and intellectual barriers that had arisen during decades of evolutionary development. The vision of the lab is to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (such as machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and en masse, as with crowd-sourced human computation). To pursue this goal, we assembled a research team with diverse interests across computer science, forged relationships with domain experts on campus and elsewhere, and obtained the support of leading industry partners and major government sponsors. The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the nearly three years the lab has been in operation, we've released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. In this talk I'll describe the current state of BDAS with an emphasis on the key components that have been released to date. I'll then discuss ongoing efforts on machine learning scalability and ease of use, including the MLbase system, as our focus moves higher up the stack. Finally I will present our longer-term views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

DoS: an efficient scheme for the diversification of multiple search results DoS:多种搜索结果多样化的有效方案

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484858

Hina A. Khan, Marina Drosou, M. Sharaf

引用次数: 11

HmSearch: an efficient hamming distance query processing algorithm 一种高效的汉明距离查询处理算法

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484842

Xiaoyan Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu

{"title":"HmSearch: an efficient hamming distance query processing algorithm","authors":"Xiaoyan Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu","doi":"10.1145/2484838.2484842","DOIUrl":"https://doi.org/10.1145/2484838.2484842","url":null,"abstract":"Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123366764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46