Statistical Analysis and Data Mining最新文献_第9页

Mining Compressing Sequential Patterns 挖掘压缩顺序模式

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2014-02-01 DOI: 10.1002/sam.11192

Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders

{"title":"Mining Compressing Sequential Patterns","authors":"Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders","doi":"10.1002/sam.11192","DOIUrl":"https://doi.org/10.1002/sam.11192","url":null,"abstract":"Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"6 1","pages":"34-52"},"PeriodicalIF":1.3,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75980591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Multi-transfer: Transfer learning with multiple views and multiple sources Multi-transfer:多视角、多源的学习迁移

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2014-01-01 DOI: 10.1137/1.9781611972832.27

Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang

{"title":"Multi-transfer: Transfer learning with multiple views and multiple sources","authors":"Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang","doi":"10.1137/1.9781611972832.27","DOIUrl":"https://doi.org/10.1137/1.9781611972832.27","url":null,"abstract":"Transfer learning, which aims to help the learning task in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications, e.g., text mining, sentiment analysis, etc. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three views/perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, actually the piece of information carried by them may compensate with each other. In this paper, we define this transfer learning problem as Transfer Learning with Multiple Views and Multiple Sources. As different sources may have different probability distributions and different views may be compensate or inconsistent with each other, merging all data in a simplistic manner will not give optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, while revise the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different state-of-the-art baselines.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"34 1","pages":"282-293"},"PeriodicalIF":1.3,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83621622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

A Weighted Random Forests Approach to Improve Predictive Performance. 一种提高预测性能的加权随机森林方法。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2013-12-01 DOI: 10.1002/sam.11196

Stacey J Winham, Robert R Freimuth, Joanna M Biernacka

{"title":"A Weighted Random Forests Approach to Improve Predictive Performance.","authors":"Stacey J Winham, Robert R Freimuth, Joanna M Biernacka","doi":"10.1002/sam.11196","DOIUrl":"https://doi.org/10.1002/sam.11196","url":null,"abstract":"Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method Random Forests (RF) can handle high-dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. Here we propose an extension called Weighted Random Forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"6 6","pages":"496-505"},"PeriodicalIF":1.3,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11196","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32096214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Penalized Regression and Risk Prediction in Genome-Wide Association Studies. 全基因组关联研究中的惩罚回归与风险预测

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2013-08-01 DOI: 10.1002/sam.11183

Erin Austin, Wei Pan, Xiaotong Shen

{"title":"Penalized Regression and Risk Prediction in Genome-Wide Association Studies.","authors":"Erin Austin, Wei Pan, Xiaotong Shen","doi":"10.1002/sam.11183","DOIUrl":"10.1002/sam.11183","url":null,"abstract":"An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"6 4","pages":""},"PeriodicalIF":1.3,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3859439/pdf/nihms534715.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31963889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regularized Partial Least Squares with an Application to NMR Spectroscopy. 正则化偏最小二乘在核磁共振光谱中的应用。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2013-08-01 DOI: 10.1002/sam.11169

Genevera I Allen, Christine Peterson, Marina Vannucci, Mirjana Maletić-Savatić

引用次数: 45

AAPL: Assessing Association between P-value Lists. 评估p值列表之间的关联。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2013-04-01 DOI: 10.1002/sam.11180

Tianwei Yu, Yize Zhao, Shihao Shen

引用次数: 1

Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model. 利用高斯过程模型预测生物系统仿真参数。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11163

Xiangxin Zhu, Max Welling, Fang Jin, John Lowengrub

{"title":"Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model.","authors":"Xiangxin Zhu, Max Welling, Fang Jin, John Lowengrub","doi":"10.1002/sam.11163","DOIUrl":"https://doi.org/10.1002/sam.11163","url":null,"abstract":"Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"5 6","pages":"509-522"},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11163","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31300011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs. 有向无环高斯图上的最大似然估计。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11168

Yiping Yuan, Xiaotong Shen, Wei Pan

引用次数: 0

Multiple Response Regression for Gaussian Mixture Models with Known Labels. 具有已知标签的高斯混杂模型的多重响应回归。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11158

Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu

{"title":"Multiple Response Regression for Gaussian Mixture Models with Known Labels.","authors":"Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu","doi":"10.1002/sam.11158","DOIUrl":"10.1002/sam.11158","url":null,"abstract":"Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"5 6","pages":""},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885347/pdf/nihms539872.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32023141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nonlinear Vertex Discriminant Analysis with Reproducing Kernels. 带再现核的非线性顶点判别分析

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2012-04-01 DOI: 10.1002/sam.11137

Tong Tong Wu, Yichao Wu

引用次数: 0