Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献_第5页

Exploring subspace clustering for recommendations 探索子空间聚类以获得推荐

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618283

Katharina Rausch, Eirini Ntoutsi, K. Stefanidis, H. Kriegel

引用次数: 8

Maintaining a microbial genome & metagenome data analysis system in an academic setting 在学术环境中维护微生物基因组和宏基因组数据分析系统

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618244

I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu

{"title":"Maintaining a microbial genome & metagenome data analysis system in an academic setting","authors":"I. Chen, V. Markowitz, E. Szeto, Krishna Palaniappan, Ken Chu","doi":"10.1145/2618243.2618244","DOIUrl":"https://doi.org/10.1145/2618243.2618244","url":null,"abstract":"The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.\u0000 From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"3:1-3:11"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74815845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Data movement in hybrid analytic systems: a case for automation 混合分析系统中的数据移动:自动化的一个案例

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618273

Patrick Leyshock, D. Maier, K. Tufte

引用次数: 1

A subspace filter supporting the discovery of small clusters in very noisy datasets 一种支持在非常嘈杂的数据集中发现小簇的子空间过滤器

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618260

F. Höppner

引用次数: 3

Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects 匹配优势:捕获多维不确定对象的优势语义

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618246

Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang

{"title":"Matching dominance: capture the semantics of dominance for multi-dimensional uncertain objects","authors":"Ying Zhang, W. Zhang, Xuemin Lin, M. A. Cheema, Chengqi Zhang","doi":"10.1145/2618243.2618246","DOIUrl":"https://doi.org/10.1145/2618243.2618246","url":null,"abstract":"The dominance operator plays an important role in a wide spectrum of multi-criteria decision making applications. Generally speaking, a dominance operator is a partial order on a set O of objects, and we say the dominance operator has the monotonic property regarding a family of ranking functions F if o1 dominates o2 implies f(o1) ≥ f(o2) for any ranking function f ∈ F and objects o1, o2 ∈ O. The dominance operator on the multi-dimensional points is well defined, which has the monotonic property regarding any monotonic ranking (scoring) function. Due to the uncertain nature of data in many emerging applications, a variety of existing works have studied the semantics of ranking query on uncertain objects. However, the problem of dominance operator against multi-dimensional uncertain objects remains open. Although there are several attempts to propose dominance operator on multi-dimensional uncertain objects, none of them claims the monotonic property on these ranking approaches.\u0000 Motivated by this, in this paper we propose a novel matching based dominance operator, namely matching dominance, to capture the semantics of the dominance for multi-dimensional uncertain objects so that the new dominance operator has the monotonic property regarding the monotonic parameterized ranking function, which can unify other popular ranking approaches for uncertain objects. Then we develop a layer indexing technique, Matching Dominance based Band (MDB), to facilitate the top k queries on multi-dimensional uncertain objects based on the matching dominance operator proposed in this paper. Efficient algorithms are proposed to compute the MDB index. Comprehensive experiments convincingly demonstrate the effectiveness and efficiency of our indexing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78363261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Efficient data management and statistics with zero-copy integration 具有零副本集成的高效数据管理和统计

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618265

Jonathan Lajus, H. Mühleisen

引用次数: 14

Geometric graph matching and similarity: a probabilistic approach 几何图匹配与相似:一种概率方法

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI: 10.1145/2618243.2618259

Ayser Armiti, Michael Gertz

{"title":"Geometric graph matching and similarity: a probabilistic approach","authors":"Ayser Armiti, Michael Gertz","doi":"10.1145/2618243.2618259","DOIUrl":"https://doi.org/10.1145/2618243.2618259","url":null,"abstract":"Finding common structures is vital for many graph-based applications, such as road network analysis, pattern recognition, or drug discovery. Such a task is formalized as the inexact graph matching problem, which is known to be NP-hard. Several graph matching algorithms have been proposed to find approximate solutions. However, such algorithms still face many problems in terms of memory consumption, runtime, and tolerance to changes in graph structure or labels.\u0000 In this paper, we propose a solution to the inexact graph matching problem for geometric graphs in 2D space. Geometric graphs provide a suitable modeling framework for applications like the above, where vertices are located in some 2D space. The main idea of our approach is to formalize the graph matching problem in a maximum likelihood estimation framework. Then, the expectation maximization technique is used to estimate the match between two graphs. We propose a novel density function that estimates the similarity between the vertices of different graphs. It is computed based on both 1) the spatial properties of a vertex and its direct neighbors, and 2) the shortest paths that connect a vertex to other vertices in a graph. To guarantee scalability, we propose to compute the density function based on the properties of sub-structures of the graph. Using representative geometric graphs from several application domains, we show that our approach outperforms existing graph matching algorithms in terms of matching quality, runtime, and memory consumption.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"55 1","pages":"27:1-27:12"},"PeriodicalIF":0.0,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83777622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Node classification in uncertain graphs 不确定图中的节点分类

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-05-22 DOI: 10.1145/2618243.2618277

Michele Dallachiesa, C. Aggarwal, Themis Palpanas

引用次数: 13

Tuning large scale deduplication with reduced effort 以更少的工作量调优大规模重复数据删除

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484873

Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves

{"title":"Tuning large scale deduplication with reduced effort","authors":"Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves","doi":"10.1145/2484838.2484873","DOIUrl":"https://doi.org/10.1145/2484838.2484873","url":null,"abstract":"Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"18:1-18:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77256395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Learning to explore scientific workflow repositories 学习探索科学的工作流存储库

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI: 10.1145/2484838.2484848

Julia Stoyanovich, Paramveer S. Dhillon, S. Davidson, Brian Lyons

引用次数: 0