Statistical Analysis and Data Mining最新文献_第8页

Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach. 具有连续结果的非线性变量选择:一种完全非参数渐进式前向阶段方法。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2018-08-01 Epub Date: 2018-06-19 DOI: 10.1002/sam.11381

Tianwei Yu

引用次数: 0

The next-generation K-means algorithm. 下一代K-means算法。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2018-08-01 Epub Date: 2018-05-11 DOI: 10.1002/sam.11379

Eugene Demidenko

{"title":"The next-generation K-means algorithm.","authors":"Eugene Demidenko","doi":"10.1002/sam.11379","DOIUrl":"10.1002/sam.11379","url":null,"abstract":"Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"11 4","pages":"153-166"},"PeriodicalIF":1.3,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6062903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36368001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling. 基于混合建模的斑马鱼脑钙图像时间序列数据的全体积聚类。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2018-02-01 Epub Date: 2017-12-06 DOI: 10.1002/sam.11366

Hien D Nguyen, Jeremy F P Ullmann, Geoffrey J McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M C Hillman, David C Reutens, Andrew L Janke

引用次数: 6

Random Forest Missing Data Algorithms. 随机森林缺失数据算法。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2017-12-01 Epub Date: 2017-06-13 DOI: 10.1002/sam.11348

Fei Tang, Hemant Ishwaran

引用次数: 377

Use and Communication of Probabilistic Forecasts. 概率预测的使用和交流。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2016-12-01 Epub Date: 2016-02-23 DOI: 10.1002/sam.11302

Adrian E Raftery

{"title":"Use and Communication of Probabilistic Forecasts.","authors":"Adrian E Raftery","doi":"10.1002/sam.11302","DOIUrl":"https://doi.org/10.1002/sam.11302","url":null,"abstract":"Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 6","pages":"397-410"},"PeriodicalIF":1.3,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34944896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Hierarchical Models for Multiple, Rare Outcomes Using Massive Observational Healthcare Databases. 利用大规模观察性医疗保健数据库建立多种罕见结果的层次模型。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2016-08-01 Epub Date: 2016-07-17 DOI: 10.1002/sam.11324

Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard

{"title":"Hierarchical Models for Multiple, Rare Outcomes Using Massive Observational Healthcare Databases.","authors":"Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard","doi":"10.1002/sam.11324","DOIUrl":"10.1002/sam.11324","url":null,"abstract":"Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 4","pages":"260-268"},"PeriodicalIF":1.3,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423675/pdf/nihms799155.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34993872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery. 非线性联合潜在变量模型与综合肿瘤亚型发现。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2016-04-01 Epub Date: 2016-03-28 DOI: 10.1002/sam.11306

Binghui Liu, Xiaotong Shen, Wei Pan

{"title":"Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery.","authors":"Binghui Liu, Xiaotong Shen, Wei Pan","doi":"10.1002/sam.11306","DOIUrl":"https://doi.org/10.1002/sam.11306","url":null,"abstract":"Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 2","pages":"106-116"},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11306","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35736330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Composite large margin classifiers with latent subclasses for heterogeneous biomedical data. 针对异构生物医学数据的具有潜在子类的复合大余量分类器。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2016-04-01 Epub Date: 2016-01-08 DOI: 10.1002/sam.11300

Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok

{"title":"Composite large margin classifiers with latent subclasses for heterogeneous biomedical data.","authors":"Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok","doi":"10.1002/sam.11300","DOIUrl":"10.1002/sam.11300","url":null,"abstract":"High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 2","pages":"75-88"},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4912001/pdf/nihms737408.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34597836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feature Import Vector Machine: A General Classifier with Flexible Feature Selection. 特征导入向量机:一种具有灵活特征选择的通用分类器。

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2015-02-01 Epub Date: 2015-01-26 DOI: 10.1002/sam.11259

Samiran Ghosh, Yazhen Wang

{"title":"Feature Import Vector Machine: A General Classifier with Flexible Feature Selection.","authors":"Samiran Ghosh, Yazhen Wang","doi":"10.1002/sam.11259","DOIUrl":"https://doi.org/10.1002/sam.11259","url":null,"abstract":"The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small n) but the number of dimensions/features are large (large p), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an optimal set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an optimal sub-dimensional model to build the final classifier with sufficient accuracy.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"8 1","pages":"49-63"},"PeriodicalIF":1.3,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34463560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Survival Analysis with Electronic Health Record Data: Experiments with Chronic Kidney Disease. 用电子健康记录数据进行生存分析:慢性肾脏疾病的实验。

IF 2.1 4区数学

Statistical Analysis and Data Mining Pub Date : 2014-10-01 Epub Date: 2014-08-19 DOI: 10.1002/sam.11236

Yolanda Hagar, David Albers, Rimma Pivovarov, Herbert Chase, Vanja Dukic, Noémie Elhadad

引用次数: 0