Statistical Analysis and Data Mining最新文献

筛选
英文 中文
Mining Compressing Sequential Patterns 挖掘压缩顺序模式
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2014-02-01 DOI: 10.1002/sam.11192
Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders
{"title":"Mining Compressing Sequential Patterns","authors":"Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders","doi":"10.1002/sam.11192","DOIUrl":"https://doi.org/10.1002/sam.11192","url":null,"abstract":"Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75980591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Multi-transfer: Transfer learning with multiple views and multiple sources Multi-transfer:多视角、多源的学习迁移
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2014-01-01 DOI: 10.1137/1.9781611972832.27
Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang
{"title":"Multi-transfer: Transfer learning with multiple views and multiple sources","authors":"Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang","doi":"10.1137/1.9781611972832.27","DOIUrl":"https://doi.org/10.1137/1.9781611972832.27","url":null,"abstract":"Transfer learning, which aims to help the learning task in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications, e.g., text mining, sentiment analysis, etc. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three views/perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, actually the piece of information carried by them may compensate with each other. In this paper, we define this transfer learning problem as Transfer Learning with Multiple Views and Multiple Sources. As different sources may have different probability distributions and different views may be compensate or inconsistent with each other, merging all data in a simplistic manner will not give optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, while revise the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different state-of-the-art baselines.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83621622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
A Weighted Random Forests Approach to Improve Predictive Performance. 一种提高预测性能的加权随机森林方法。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2013-12-01 DOI: 10.1002/sam.11196
Stacey J Winham, Robert R Freimuth, Joanna M Biernacka
{"title":"A Weighted Random Forests Approach to Improve Predictive Performance.","authors":"Stacey J Winham,&nbsp;Robert R Freimuth,&nbsp;Joanna M Biernacka","doi":"10.1002/sam.11196","DOIUrl":"https://doi.org/10.1002/sam.11196","url":null,"abstract":"<p><p>Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method Random Forests (RF) can handle high-dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. Here we propose an extension called Weighted Random Forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11196","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32096214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Penalized Regression and Risk Prediction in Genome-Wide Association Studies. 全基因组关联研究中的惩罚回归与风险预测
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2013-08-01 DOI: 10.1002/sam.11183
Erin Austin, Wei Pan, Xiaotong Shen
{"title":"Penalized Regression and Risk Prediction in Genome-Wide Association Studies.","authors":"Erin Austin, Wei Pan, Xiaotong Shen","doi":"10.1002/sam.11183","DOIUrl":"10.1002/sam.11183","url":null,"abstract":"<p><p>An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3859439/pdf/nihms534715.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31963889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Regularized Partial Least Squares with an Application to NMR Spectroscopy. 正则化偏最小二乘在核磁共振光谱中的应用。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2013-08-01 DOI: 10.1002/sam.11169
Genevera I Allen, Christine Peterson, Marina Vannucci, Mirjana Maletić-Savatić
{"title":"Regularized Partial Least Squares with an Application to NMR Spectroscopy.","authors":"Genevera I Allen,&nbsp;Christine Peterson,&nbsp;Marina Vannucci,&nbsp;Mirjana Maletić-Savatić","doi":"10.1002/sam.11169","DOIUrl":"https://doi.org/10.1002/sam.11169","url":null,"abstract":"<p><p>High-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for non-negative PLS and generalized PLS, an adoption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11169","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32104846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
AAPL: Assessing Association between P-value Lists. 评估p值列表之间的关联。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2013-04-01 DOI: 10.1002/sam.11180
Tianwei Yu, Yize Zhao, Shihao Shen
{"title":"AAPL: Assessing Association between P-value Lists.","authors":"Tianwei Yu,&nbsp;Yize Zhao,&nbsp;Shihao Shen","doi":"10.1002/sam.11180","DOIUrl":"https://doi.org/10.1002/sam.11180","url":null,"abstract":"<p><p>Joint analyses of high-throughput datasets generate the need to assess the association between two long lists of p-values. In such p-value lists, the vast majority of the features are insignificant. Ideally contributions of features that are null in both tests should be minimized. However, by random chance their p-values are uniformly distributed between zero and one, and weak correlations of the p-values may exist due to inherent biases in the high-throughput technology used to generate the multiple datasets. Rank-based agreement test may capture such unwanted effects. Testing contingency tables generated using hard cutoffs may be sensitive to arbitrary threshold choice. We develop a novel method based on feature-level concordance using local false discovery rate. The association score enjoys straight-forward interpretation. The method shows higher statistical power to detect association between p-value lists in simulation. We demonstrate its utility using real data analysis. The R implementation of the method is available at http://userwww.service.emory.edu/~tyu8/AAPL/.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11180","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31392998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model. 利用高斯过程模型预测生物系统仿真参数。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11163
Xiangxin Zhu, Max Welling, Fang Jin, John Lowengrub
{"title":"Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model.","authors":"Xiangxin Zhu,&nbsp;Max Welling,&nbsp;Fang Jin,&nbsp;John Lowengrub","doi":"10.1002/sam.11163","DOIUrl":"https://doi.org/10.1002/sam.11163","url":null,"abstract":"<p><p>Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11163","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31300011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs. 有向无环高斯图上的最大似然估计。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11168
Yiping Yuan, Xiaotong Shen, Wei Pan
{"title":"Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs.","authors":"Yiping Yuan, Xiaotong Shen, Wei Pan","doi":"10.1002/sam.11168","DOIUrl":"10.1002/sam.11168","url":null,"abstract":"<p><p>Estimation of multiple directed graphs becomes challenging in the presence of inhomogeneous data, where directed acyclic graphs (DAGs) are used to represent causal relations among random variables. To infer causal relations among variables, we estimate multiple DAGs given a known ordering in Gaussian graphical models. In particular, we propose a constrained maximum likelihood method with nonconvex constraints over elements and element-wise differences of adjacency matrices, for identifying the sparseness structure as well as detecting structural changes over adjacency matrices of the graphs. Computationally, we develop an efficient algorithm based on augmented Lagrange multipliers, the difference convex method, and a novel fast algorithm for solving convex relaxation subproblems. Numerical results suggest that the proposed method performs well against its alternatives for simulated and real data.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866136/pdf/nihms461070.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31973834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiple Response Regression for Gaussian Mixture Models with Known Labels. 具有已知标签的高斯混杂模型的多重响应回归。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2012-12-01 DOI: 10.1002/sam.11158
Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu
{"title":"Multiple Response Regression for Gaussian Mixture Models with Known Labels.","authors":"Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu","doi":"10.1002/sam.11158","DOIUrl":"10.1002/sam.11158","url":null,"abstract":"<p><p>Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885347/pdf/nihms539872.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32023141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonlinear Vertex Discriminant Analysis with Reproducing Kernels. 带再现核的非线性顶点判别分析
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2012-04-01 DOI: 10.1002/sam.11137
Tong Tong Wu, Yichao Wu
{"title":"Nonlinear Vertex Discriminant Analysis with Reproducing Kernels.","authors":"Tong Tong Wu, Yichao Wu","doi":"10.1002/sam.11137","DOIUrl":"10.1002/sam.11137","url":null,"abstract":"<p><p>The novel supervised learning method of vertex discriminant analysis (VDA) has been demonstrated for its good performance in multicategory classification. The current paper explores an elaboration of VDA for nonlinear discrimination. By incorporating reproducing kernels, VDA can be generalized from linear discrimination to nonlinear discrimination. Our numerical experiments show that the new reproducing kernel-based method leads to accurate classification for both linear and nonlinear cases.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510707/pdf/nihms419106.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31092124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信