B. Schlegel, Tim Kiefer, T. Kissinger, Wolfgang Lehner
{"title":"pcApriori: scalable apriori for multiprocessor systems","authors":"B. Schlegel, Tim Kiefer, T. Kissinger, Wolfgang Lehner","doi":"10.1145/2484838.2484879","DOIUrl":"https://doi.org/10.1145/2484838.2484879","url":null,"abstract":"Frequent-itemset mining is an important part of data mining. It is a computational and memory intensive task and has a large number of scientific and statistical application areas. In many of them, the datasets can easily grow up to tens or even several hundred gigabytes of data. Hence, efficient algorithms are required to process such amounts of data. In the recent years, there have been proposed many efficient sequential mining algorithms, which however cannot exploit current and future systems providing large degrees of parallelism. Contrary, the number of parallel frequent-itemset mining algorithms is rather small and most of them do not scale well as the number of threads is largely increased. In this paper, we present a highly-scalable mining algorithm that is based on the well-known Apriori algorithm; it is optimized for processing very large datasets on multiprocessor systems. The key idea of pcApriori is to employ a modified producer--consumer processing scheme, which partitions the data during processing and distributes it to the available threads. We conduct many experiments on large datasets. pcApriori scales almost linear on our test system comprising 32 cores.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale dissemination of time series data","authors":"Qingsong Guo, Yongluan Zhou, Li Su","doi":"10.1145/2484838.2484878","DOIUrl":"https://doi.org/10.1145/2484838.2484878","url":null,"abstract":"In this paper, we consider the problem of continuous dissemination of time series data, such as sensor measurements, to a large number of subscribers. These subscribers fall into multiple subscription levels, where each subscription level is specified by the bandwidth constraint of a subscriber, which is an abstract indicator for both the physical limits and the amount of data that the subscriber would like to handle. To handle this problem, we propose a system framework for multi-scale time series data dissemination that employs a typical tree-based dissemination network and existing time-series compression models. Due to the bandwidth limits regarding to potentially sheer speed of data, it is inevitable to compress and re-compress data along the dissemination paths according to the subscription level of each node. Compression would caused the accuracy loss of data, thus we devise several algorithms to optimize the average accuracies of the data received by all subscribers within the dissemination network. Finally, we have conducted extensive experiments to study the performance of the algorithms.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128846006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Dobos, I. Csabai, A. Szalay, T. Budavári, Nolan Li
{"title":"Graywulf: a platform for federated scientific databases and services","authors":"L. Dobos, I. Csabai, A. Szalay, T. Budavári, Nolan Li","doi":"10.1145/2484838.2484863","DOIUrl":"https://doi.org/10.1145/2484838.2484863","url":null,"abstract":"Many fields of science rely on relational database management systems to analyze, publish and share data. Since RDBMS are originally designed for, and their development directions are primarily driven by, business use cases they often lack features very important for scientific applications. Horizontal scalability is probably the most important missing feature which makes it challenging to adapt traditional relational database systems to the ever growing data sizes. Due to the limited support of array data types and metadata management, successful application of RDBMS in science usually requires the development of custom extensions. While some of these extensions are specific to the field of science, the majority of them could easily be generalized and reused in other disciplines. With the Graywulf project we intend to target several goals. We are building a generic platform that offers reusable components for efficient storage, transformation, statistical analysis and presentation of scientific data stored in Microsoft SQL Server. Graywulf also addresses the distributed computational issues arising from current RDBMS technologies. The current version supports load balancing of simple queries and parallel execution of partitioned queries over a set of mirrored databases. Uniform user access to the data is provided through a web based query interface and a data surface for software clients. Queries are formulated in a slightly modified syntax of SQL that offers a transparent view of the distributed data. The software library consists of several components that can be reused to develop complex scientific data warehouses: a system registry, administration tools to manage entire database server clusters, a sophisticated workflow execution framework, and a SQL parser library.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122953006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis
{"title":"Accelerating gene context analysis using bitmaps","authors":"A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis","doi":"10.1145/2484838.2484856","DOIUrl":"https://doi.org/10.1145/2484838.2484856","url":null,"abstract":"Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called \"gene cassettes\", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer. We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a \"killer query\", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer \"killer queries\" in seconds. Even with an extremely complex \"killer query\" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115382102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brigitte Boden, Stephan Günnemann, H. Hoffmann, T. Seidl
{"title":"RMiCS: a robust approach for mining coherent subgraphs in edge-labeled multi-layer graphs","authors":"Brigitte Boden, Stephan Günnemann, H. Hoffmann, T. Seidl","doi":"10.1145/2484838.2484860","DOIUrl":"https://doi.org/10.1145/2484838.2484860","url":null,"abstract":"Detecting dense subgraphs in a large graph is an important graph mining problem and various approaches have been proposed for its solution. While most existing methods only consider unlabeled and one-dimensional graph data, many real-world applications provide far richer information. Thus, in our work, we consider graphs that contain different types of edges -- represented as different layers/dimensions of a graph -- as well as edge labels that further characterize the relations between two vertices. We argue that exploiting this additional information supports the detection of more interesting clusters. In general, we aim at detecting clusters of vertices that are densely connected by edges with similar labels in subsets of the graph layers. So far, there exists only a single method that tries to detect clusters in such graphs. This method, however, is highly sensitive to noise: already a single edge with a deviating label can completely hinder the detection of interesting clusters. In this paper, we present the RCS (Robust Coherent Subgraph) model which enables us to detect clusters even in noisy data. This robustness greatly enhances the applicability on real-world data. In order to obtain interpretable results, RCS avoids redundant clusters in the result set. We present the algorithm RMiCS for an efficient detection of RCS clusters and we analyze its behavior in various experiments on synthetic and real-world data.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123645571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reasoning about RFID-tracked moving objects in symbolic indoor spaces","authors":"Sari Haj Hussein, Hua Lu, T. Pedersen","doi":"10.1145/2484838.2484877","DOIUrl":"https://doi.org/10.1145/2484838.2484877","url":null,"abstract":"In recent years, indoor spatial data management has started to attract attention, partly due to the increasing use of receptor devices (e.g., RFID readers, and wireless sensor networks) in indoor, as well as outdoor spaces. There is thus a great need for a model that captures such spaces, their receptors, and provides powerful reasoning techniques on top. This paper reviews and extends a recent unified model of outdoor and indoor spaces and receptor deployments in these spaces. The extended model enables modelers to capture various information pieces from the physical world. On top of the extended model, this paper proposes and formalizes the route observability concept, and demonstrates its usefulness in enhancing the reading environment. The extended model also enables incorporating receptor data through a probabilistic trajectory-to-route translator. This translator first facilitates the tracking of moving objects enabling the search for them to be optimized, and second supports high-level reasoning about points of potential traffic (over)load, so-called bottleneck points. The functional analysis illustrates the behavior of the route observability function. The experimental evaluation shows the accuracy of the translator, and the quality of the inference and reasoning. The experiments are conducted on both synthetic data and uncleansed, real-world data obtained from RFID-tagged flight baggage.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"2 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115006228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SMIX: self-managing indexes for dynamic workloads","authors":"H. Voigt, T. Kissinger, Wolfgang Lehner","doi":"10.1145/2484838.2484862","DOIUrl":"https://doi.org/10.1145/2484838.2484862","url":null,"abstract":"As databases accumulate growing amounts of data at an increasing rate, adaptive indexing becomes more and more important. At the same time, applications and their use get more agile and flexible, resulting in less steady and less predictable workload characteristics. Being inert and coarse-grained, state-of-the-art index tuning techniques become less useful in such environments. Especially the full-column indexing paradigm results in many indexed but never queried records and prohibitively high storage and maintenance costs. In this paper, we present Self-Managing Indexes, a novel, adaptive, fine-grained, autonomous indexing infrastructure. In its core, our approach builds on a novel access path that automatically collects useful index information, discards useless index information, and competes with its kind for resources to host its index information. Compared to existing technologies for adaptive indexing, we are able to dynamically grow and shrink our indexes, instead of incrementally enhancing the index granularity.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129371514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Making sense of big data with the Berkeley data analytics stack","authors":"M. Franklin","doi":"10.1145/2484838.2484884","DOIUrl":"https://doi.org/10.1145/2484838.2484884","url":null,"abstract":"The Berkeley AMPLab was founded on the idea that the challenges of emerging Big Data applications require a new approach to analytics systems. Launching in early 2011, the project set out to rethink the traditional analytics stack, breaking down technical and intellectual barriers that had arisen during decades of evolutionary development. The vision of the lab is to seamlessly integrate the three main resources available for making sense of data at scale: Algorithms (such as machine learning and statistical techniques), Machines (in the form of scalable clusters and elastic cloud computing), and People (both individually as analysts and en masse, as with crowd-sourced human computation). To pursue this goal, we assembled a research team with diverse interests across computer science, forged relationships with domain experts on campus and elsewhere, and obtained the support of leading industry partners and major government sponsors. The lab is realizing its ideas through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. In the nearly three years the lab has been in operation, we've released major components of BDAS. Several of these components have gained significant traction in industry and elsewhere: the Mesos cluster resource manager, the Spark in-memory computation framework, and the Shark query processing system. In this talk I'll describe the current state of BDAS with an emphasis on the key components that have been released to date. I'll then discuss ongoing efforts on machine learning scalability and ease of use, including the MLbase system, as our focus moves higher up the stack. Finally I will present our longer-term views of how all the pieces will fit together to form a system that can adaptively bring the right resources to bear on a given data-driven question to meet time, cost and quality requirements throughout the analytics lifecycle.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DoS: an efficient scheme for the diversification of multiple search results","authors":"Hina A. Khan, Marina Drosou, M. Sharaf","doi":"10.1145/2484838.2484858","DOIUrl":"https://doi.org/10.1145/2484838.2484858","url":null,"abstract":"Data diversification provides users with a concise and meaningful view of the results returned by search queries. In addition to taming the information overload, data diversification also provides the benefits of reducing data communication costs as well as enabling data exploration. The explosion of big data emphasizes the need for data diversification in modern data management platforms, especially for applications based on web, scientific, and business databases. Achieving effective diversification, however, is rather a challenging task due to the inherent high processing costs of current data diversification techniques. This challenge is further accentuated in a multi-user environment, in which multiple search queries are to be executed and diversified concurrently. In this paper, we propose the DoS scheme, which addresses the problem of scalable diversification of multiple search results. Our experimental evaluation shows the scalability exhibited by DoS under various workload settings, and the significant benefits it provides compared to sequential methods.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129432299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyan Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu
{"title":"HmSearch: an efficient hamming distance query processing algorithm","authors":"Xiaoyan Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu","doi":"10.1145/2484838.2484842","DOIUrl":"https://doi.org/10.1145/2484838.2484842","url":null,"abstract":"Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123366764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}