Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献
Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel
{"title":"Exploring subspace clustering for recommendations","authors":"Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel","doi":"10.1145/2618243.2618283","DOIUrl":"https://doi.org/10.1145/2618243.2618283","url":null,"abstract":"Typically, recommendations are computed by considering users similar to the user in question. However, scanning the whole database of users for locating similar users is expensive. Existing approaches build user profiles by employing full-dimensional clustering to find sets of similar users. As the datasets we deal with are high-dimensional and incomplete, full-dimensional clustering is not the best option. To this end, we explore the fault tolerance subspace clustering approach that detects clusters of similar users in subspaces of the original feature space and also allows for missing values. Our experiments on real movie datasets show that the diversification of the similar users through subspace clustering results in better recommendations comparing to traditional collaborative filtering and full dimensional clustering approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"5 1","pages":"42:1-42:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82128493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu
{"title":"Maintaining a microbial genome & metagenome data analysis system in an academic setting","authors":"I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu","doi":"10.1145/2618243.2618244","DOIUrl":"https://doi.org/10.1145/2618243.2618244","url":null,"abstract":"The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.\u0000 From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"3:1-3:11"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74815845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data movement in hybrid analytic systems: a case for automation","authors":"Patrick Leyshock, D. Maier, K. Tufte","doi":"10.1145/2618243.2618273","DOIUrl":"https://doi.org/10.1145/2618243.2618273","url":null,"abstract":"Hybrid data analysis systems integrate an analytic tool and a data management tool. While hybrid systems have benefits, in order to be effective data movement between the two hybrid components must be minimized. Through experimental results we demonstrate that under workloads whose inputs vary in size, shape, and location, automation is the only practical way to manage data movement in hybrid systems.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"192 1","pages":"39:1-39:4"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76567383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A subspace filter supporting the discovery of small clusters in very noisy datasets","authors":"F. Höppner","doi":"10.1145/2618243.2618260","DOIUrl":"https://doi.org/10.1145/2618243.2618260","url":null,"abstract":"Feature selection becomes crucial when exploring high-dimensional datasets via clustering, because it is unlikely that the data groups jointly in all dimensions but clustering algorithms treat all attributes equally. A new subspace filter approach is presented that is capable of coping with the difficult situation of finding small clusters embedded in a very noisy environment (more noise than clustering data), which is not mislead by dense, high-dimensional spots caused by density fluctuations of single attributes. Experimental evaluation on artificial and real datasets demonstrate good performance and high efficiency.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"299 1","pages":"14:1-14:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75434661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang
{"title":"Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects","authors":"Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang","doi":"10.1145/2618243.2618246","DOIUrl":"https://doi.org/10.1145/2618243.2618246","url":null,"abstract":"The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a <i>partial order</i> on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if <i>o</i><sub>1</sub> <i>dominates</i> <i>o</i><sub>2</sub> implies <i>f</i>(<i>o</i><sub>1</sub>) ≥ <i>f</i>(<i>o</i><sub>2</sub>) for any ranking function <i>f</i> ∈ F and objects <i>o</i><sub>1</sub>, <i>o</i><sub>2</sub> ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches.\u0000 Motivated by this, in this paper we propose a novel <i>matching</i> based <i>dominance</i> operator, namely <b>matching dominance</b>, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic <i>parameterized ranking</i> function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (<b>MDB</b>), to facilitate the top <i>k</i> queries on multi-dimensional uncertain objects based on the <i>matching dominance</i> operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78363261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient data management and statistics with zero-copy integration","authors":"Jonathan Lajus, H. Mühleisen","doi":"10.1145/2618243.2618265","DOIUrl":"https://doi.org/10.1145/2618243.2618265","url":null,"abstract":"Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"12:1-12:10"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73999681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Geometric graph matching and similarity: a probabilistic approach","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2618243.2618259","DOIUrl":"https://doi.org/10.1145/2618243.2618259","url":null,"abstract":"Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels.\u0000 In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"55 1","pages":"27:1-27:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83777622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Node classification in uncertain graphs","authors":"Michele Dallachiesa, C. Aggarwal, Themis Palpanas","doi":"10.1145/2618243.2618277","DOIUrl":"https://doi.org/10.1145/2618243.2618277","url":null,"abstract":"In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model, and show the benefits of incorporating uncertainty in the classification process as a first-class citizen. The experimental results demonstrate the effectiveness of our approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"82 1","pages":"32:1-32:4"},"PeriodicalIF":0.0,"publicationDate":"2014-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85597296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves
{"title":"Tuning large scale deduplication with reduced effort","authors":"Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves","doi":"10.1145/2484838.2484873","DOIUrl":"https://doi.org/10.1145/2484838.2484873","url":null,"abstract":"Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77256395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons
{"title":"Learning to explore scientific workflow repositories","authors":"Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons","doi":"10.1145/2484838.2484848","DOIUrl":"https://doi.org/10.1145/2484838.2484848","url":null,"abstract":"Scientific workflows are gaining popularity, and repositories of workflows are starting to emerge. In this paper we describe TopicsExplorer, a data exploration approach for myExperiment.org, a collaborative platform for the exchange of scientific workflows and experimental plans. Our approach uses a variant of topic modeling with tags as features, and generates a browsable view of the repository. TopicsExplorer has been fully integrated into the open-source platform of myExperiment.org, and is available to users at www.myexperiment.org/topics. We also present our recently developed personalization component that customizes topics based on user feedback. Finally, we discuss our ongoing performance optimization efforts that make computing and managing personalized topic views of the myExperiment.org repository feasible.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"31:1-31:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85378695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}