Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
{"title":"A study of partitioning and parallel UDF execution with the SAP HANA database","authors":"Philippe Grosse, Norman May, Wolfgang Lehner","doi":"10.1145/2618243.2618274","DOIUrl":"https://doi.org/10.1145/2618243.2618274","url":null,"abstract":"Large-scale data analysis relies on custom code both for preparing the data for analysis as well as for the core analysis algorithms. The map-reduce framework offers a simple model to parallelize custom code, but it does not integrate well with relational databases. Likewise, the literature on optimizing queries in relational databases has largely ignored user-defined functions (UDFs). In this paper, we discuss annotations for user-defined functions that facilitate optimizations that both consider relational operators and UDFs. In this paper we focus on optimizations that enable the parallel execution of relational operators and UDFs for a number of typical patterns. A study on real-world data investigates the opportunities for parallelization of complex data flows containing both relational operators and UDFs.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"25 1","pages":"36:1-36:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84370109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining statistically sound co-location patterns at multiple distances","authors":"Sajib Barua, J. Sander","doi":"10.1145/2618243.2618261","DOIUrl":"https://doi.org/10.1145/2618243.2618261","url":null,"abstract":"Existing co-location mining algorithms require a user provided distance threshold at which prevalent patterns are searched. Since spatial interactions, in reality, may happen at different distances, finding the right distance threshold to mine all true patterns is not easy and a single appropriate threshold may not even exist. A standard co-location mining algorithm also requires a prevalence measure threshold to find prevalent patterns. The prevalence measure values of the true co-location patterns occurring at different distances may vary and finding a prevalence measure threshold to mine all true patterns without reporting random patterns is not easy and sometimes not even possible. In this paper, we propose an algorithm to mine true co-location patterns at multiple distances. Our approach is based on a statistical test and does not require thresholds for the prevalence measure and the interaction distance. We evaluate the efficacy of our algorithm using synthetic and real data sets comparing it with the state-of-the-art co-location mining approach.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"7:1-7:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85548398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm
{"title":"Detecting correlated columns in relational databases with mixed data types","authors":"H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm","doi":"10.1145/2618243.2618251","DOIUrl":"https://doi.org/10.1145/2618243.2618251","url":null,"abstract":"In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns.\u0000 In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"12 1","pages":"30:1-30:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74592234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New approaches to storing and manipulating multi-dimensional sparse arrays","authors":"E. Otoo, Hairong Wang, Gideon Nimako","doi":"10.1145/2618243.2618281","DOIUrl":"https://doi.org/10.1145/2618243.2618281","url":null,"abstract":"In this paper, we introduce some storage schemes for multi-dimensional sparse arrays (MDSAs) that handle the sparsity of the array with two primary goals; reducing the storage overhead and maintaining efficient data element access. Four schemes are proposed. These are: i.) The PATRICIA trie compressed storage method (PTCS) which uses PATRICIA trie to store the valid non-zero array elements; ii.)The extended compressed row storage (xCRS) which extends CRS method for sparse matrix storage to sparse arrays of higher dimensions and achieves the best data element access efficiency of all the methods; iii.) The bit encoded xCRS (BxCRS) which optimizes the storage utilization of xCRS by applying data compression methods with run length encoding, while maintaining its data access efficiency; and iv.) a hybrid approach that provides a desired balance between the storage utilization and data manipulation efficiency by combining xCRS and the Bit Encoded Sparse Storage (BESS). These storage schemes were evaluated and compared on three basic array operations; constructing the storage scheme, accessing a random element and retrieving a sub-array, using a set of synthetic sparse multi-dimensional arrays.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"48 1","pages":"41:1-41:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87316217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simulation workflow design tailor-made for scientists","authors":"P. Reimann, H. Schwarz","doi":"10.1145/2618243.2618291","DOIUrl":"https://doi.org/10.1145/2618243.2618291","url":null,"abstract":"Scientific workflows have to deal with highly heterogeneous data environments. In particular, they have to carry out complex data provisioning tasks that filter and transform heterogeneous input data in such a way that underlying tools or services can ingest them. This results in a high complexity of workflow design. Scientists often want to design their workflows on their own, but usually do not have the necessary skills to cope with this complexity. Therefore, we have developed a pattern-based approach to workflow design, thereby mainly focusing on workflows that realize numeric simulations [4]. This approach removes the burden from scientists to specify low-level details of data provisioning. In this demonstration, we apply a prototype implementation of our approach to various use cases and show how it makes simulation workflow design tailor-made for scientists.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"49:1-49:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82763244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Dobos, I. Csabai, J. Szalai-Gindl, T. Budavári, A. Szalay
{"title":"Point cloud databases","authors":"L. Dobos, I. Csabai, J. Szalai-Gindl, T. Budavári, A. Szalay","doi":"10.1145/2618243.2618275","DOIUrl":"https://doi.org/10.1145/2618243.2618275","url":null,"abstract":"We introduce the concept of the point cloud database, a new kind of database system aimed primarily towards scientific applications. Many scientific observations, experiments, feature extraction algorithms and large-scale simulations produce enormous amounts of data that are better represented as sparse (but often highly-clustered) points in a k-dimensional (k ≲ 10) metric space than on a multi-dimensional grid. Dimensionality reduction techniques, such as principal components, are also widely-used to project high dimensional data into similarly low dimensional spaces. Analysis techniques developed to work on multi-dimensional data points are usually implemented as in-memory algorithms and need to be modified to work in distributed cluster environments and on large amounts of disk-resident data. We conclude that the relational model, with certain additions, is appropriate for point clouds, but point cloud databases must also provide unique set of spatial search and proximity join operators, indexing schemes, and query language constructs that make them a distinct class of database systems.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"83 1","pages":"33:1-33:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80650199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiuqiang Chen, Sarah Cohen Boulakia, C. Froidevaux, C. Goble, P. Missier, Alan R. Williams
{"title":"DistillFlow: removing redundancy in scientific workflows","authors":"Jiuqiang Chen, Sarah Cohen Boulakia, C. Froidevaux, C. Goble, P. Missier, Alan R. Williams","doi":"10.1145/2618243.2618287","DOIUrl":"https://doi.org/10.1145/2618243.2618287","url":null,"abstract":"Scientific workflows management systems are increasingly used by scientists to specify complex data processing pipelines. Workflows are represented using a graph structure, where nodes represent tasks and links represent the dataflow. However, the complexity of workflow structures is increasing over time, reducing the rate of scientific workflows reuse. Here, we introduce DistillFlow, a tool based on effective methods for workflow design, with a focus on the Taverna model. DistillFlow is able to detect \"anti-patterns\" in the structure of workflows (idiomatic forms that lead to over-complicated design) and replace them with different patterns to reduce the workflow's overall structural complexity. Rewriting workflows in this way is beneficial both in terms of user experience and workflow maintenance.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"26 1","pages":"46:1-46:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75565674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marwan Hassani, P. Kranen, Rajveer Saini, T. Seidl
{"title":"Subspace anytime stream clustering","authors":"Marwan Hassani, P. Kranen, Rajveer Saini, T. Seidl","doi":"10.1145/2618243.2618286","DOIUrl":"https://doi.org/10.1145/2618243.2618286","url":null,"abstract":"Clustering of high dimensional streaming data is an emerging field of research. A real life data stream imposes many challenges on the clustering task, as an endless amount of data arrives constantly. A lot of research has been done in the full space stream clustering. To handle the varying speeds of the data stream, \"anytime\" algorithms are proposed but so far only in full space stream clustering. However, data streams from many application domains contain abundance of dimensions; the clusters often exist only in specific subspaces (subset of dimensions) and do not show up in the full feature space. In this paper, the first algorithm that considers both the high dimensionality and the varying speeds of streaming data, is proposed. The algorithm, called SubClusTree, can flexibly adapt to the different stream speeds and makes the best use of available time to provide a high quality subspace clustering. The experimental results prove the effectiveness of our anytime subspace concept.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"37:1-37:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85476952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication-efficient preference top-k monitoring queries via subscriptions","authors":"Kamalas Udomlamlert, T. Hara, S. Nishio","doi":"10.1145/2618243.2618284","DOIUrl":"https://doi.org/10.1145/2618243.2618284","url":null,"abstract":"With the increase of data generation in distributed fashions such as peer-to-peer systems and sensor networks, top-k query processing which returns only a small set of data that satisfies many users' preferences, becomes a substantial issue. When data are periodically updated in each epoch e.g., weather information, without any techniques, a naive solution is to aggregate all data and their updates to ensure the correctness of final answers, however, it is too costly in terms of data transfer especially for data aggregator nodes. In this paper, we propose a top-k monitoring query processing method in 2-tier distributed systems based on a publish-subscribe scheme. A set of top-k subscriptions specifying summary scope of users' interests is informed to aggregators to limit the number of transferred data records for each epoch. In addition, instead of issuing subscriptions of all queries, our method identifies a small set of minimal subscriptions resulting in lower communication overhead. Our experiments show that our technique is efficient and outperforms other comparative reactive methods.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"29 1","pages":"44:1-44:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81215567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local context selection for outlier ranking in graphs with multiple numeric node attributes","authors":"Patricia Iglesias Sánchez, Emmanuel Müller, Oretta Irmler, Klemens Böhm","doi":"10.1145/2618243.2618266","DOIUrl":"https://doi.org/10.1145/2618243.2618266","url":null,"abstract":"Outlier ranking aims at the distinction between exceptional outliers and regular objects by measuring deviation of individual objects. In graphs with multiple numeric attributes, not all the attributes are relevant or show dependencies with the graph structure. Considering both graph structure and all given attributes, one cannot measure a clear deviation of objects. This is because the existence of irrelevant attributes clearly hinders the detection of outliers. Thus, one has to select local outlier contexts including only those attributes showing a high contrast between regular and deviating objects. It is an open challenge to detect meaningful local contexts for each node in attributed graphs.\u0000 In this work, we propose a novel local outlier ranking model for graphs with multiple numeric node attributes. For each object, our technique determines its subgraph and its statistically relevant subset of attributes locally. This context selection enables a high contrast between an outlier and the regular objects. Out of this context, we compute the outlierness score by incorporating both the attribute value deviation and the graph structure. In our evaluation on real and synthetic data, we show that our approach is able to detect contextual outliers that are missed by other outlier models.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"4 1","pages":"16:1-16:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84587883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}