Statistical Analysis and Data Mining最新文献

筛选
英文 中文
Driving mode analysis—How uncertain functional inputs propagate to an output 驱动模式分析-不确定的功能输入如何传播到输出
4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-10-06 DOI: 10.1002/sam.11646
Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker
{"title":"Driving mode analysis—How uncertain functional inputs propagate to an output","authors":"Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker","doi":"10.1002/sam.11646","DOIUrl":"https://doi.org/10.1002/sam.11646","url":null,"abstract":"Abstract Driving mode analysis elucidates how correlated features of uncertain functional inputs jointly propagate to produce uncertainty in the output of a computation. Uncertain input functions are decomposed into three terms: the mean functions, a zero‐mean driving mode, and zero‐mean residual. The random driving mode varies along a single direction, having fixed functional shape and random scale. It is uncorrelated with the residual, and under linear error propagation, it produces an output variance equal to that of the full input uncertainty. Finally, the driving mode best represents how input uncertainties propagate to the output because it minimizes expected squared Mahalanobis distance amongst competitors. These characteristics recommend interpretation of the driving mode as the single‐degree‐of‐freedom component of input uncertainty that drives output uncertainty. We derive the functional driving mode, show its superiority to other seemingly sensible definitions, and demonstrate the utility of driving mode analysis in an application. The application is the simulation of neutron transport in criticality experiments. The uncertain input functions are nuclear data that describe how Pu reacts to bombardment by neutrons. Visualization of the driving mode helps scientists understand what aspects of correlated functional uncertainty have effects that either reinforce or cancel one another in propagating to the output of the simulation.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134944202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Residuals and diagnostics for multinomial regression models 多项回归模型的残差和诊断
4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-09-29 DOI: 10.1002/sam.11645
Eric A. E. Gerber, Bruce A. Craig
{"title":"Residuals and diagnostics for multinomial regression models","authors":"Eric A. E. Gerber, Bruce A. Craig","doi":"10.1002/sam.11645","DOIUrl":"https://doi.org/10.1002/sam.11645","url":null,"abstract":"Abstract In this paper, we extend the concept of a randomized quantile residual to multinomial regression models. Customary diagnostics for these models are limited because they involve difficult‐to‐interpret residuals and often focus on the fit of one category versus the rest. Our residuals account for associations between categories by using the squared Mahalanobis distances of the observed log‐odds relative to their fitted sampling distributions. Aside from sampling variation, these residuals are exactly normal when the data come from the fitted model. This motivates our use of the residuals to detect model misspecification and overdispersion, in addition to an overall goodness‐of‐fit Kolmogorov–Smirnov test. We illustrate the use of the residuals and diagnostics in both simulation and real data studies.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135199164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stratified learning: A general‐purpose statistical method for improved learning under covariate shift 分层学习:一种在协变量移位下改善学习的通用统计方法
4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-09-29 DOI: 10.1002/sam.11643
Maximilian Autenrieth, David A. Van Dyk, Roberto Trotta, David C. Stenning
{"title":"Stratified learning: A general‐purpose statistical method for improved learning under covariate shift","authors":"Maximilian Autenrieth, David A. Van Dyk, Roberto Trotta, David C. Stenning","doi":"10.1002/sam.11643","DOIUrl":"https://doi.org/10.1002/sam.11643","url":null,"abstract":"Abstract We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well‐established methodology in causal inference and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much‐improved target prediction. We refer to the overall method as Stratified Learning, or StratLearn . We demonstrate the effectiveness of this general‐purpose method on two contemporary research questions in cosmology, outperforming state‐of‐the‐art importance weighting methods. We obtain the best‐reported AUC (0.958) on the updated “Supernovae photometric classification challenge,” and we improve upon existing conditional density estimation of galaxy redshift from Sloan Digital Sky Survey (SDSS) data.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135131270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On difference‐based gradient estimation in nonparametric regression 非参数回归中基于差分的梯度估计
4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-09-16 DOI: 10.1002/sam.11644
Maoyu Zhang, Wenlin Dai
{"title":"On difference‐based gradient estimation in nonparametric regression","authors":"Maoyu Zhang, Wenlin Dai","doi":"10.1002/sam.11644","DOIUrl":"https://doi.org/10.1002/sam.11644","url":null,"abstract":"Abstract We propose a framework to directly estimate the gradient in multivariate nonparametric regression models that bypasses fitting the regression function. Specifically, we construct the estimator as a linear combination of adjacent observations with the coefficients from a vector‐valued difference sequence, so it is more flexible than existing methods. Under the equidistant designs, closed‐form solutions of the optimal sequences are derived by minimizing the estimation variance, with the estimation bias well controlled. We derive the theoretical properties of the estimators and show that they achieve the optimal convergence rate. Further, we propose a data‐driven tuning parameter‐selection criterion for practical implementation. The effectiveness of our estimators is validated via simulation studies and a real data application.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135308618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrative Learning of Structured High-Dimensional Data from Multiple Datasets. 从多个数据集整合学习结构化高维数据
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-04-01 Epub Date: 2022-11-08 DOI: 10.1002/sam.11601
Changgee Chang, Zongyu Dai, Jihwan Oh, Qi Long
{"title":"Integrative Learning of Structured High-Dimensional Data from Multiple Datasets.","authors":"Changgee Chang, Zongyu Dai, Jihwan Oh, Qi Long","doi":"10.1002/sam.11601","DOIUrl":"10.1002/sam.11601","url":null,"abstract":"<p><p>Integrative learning of multiple datasets has the potential to mitigate the challenge of small <i>n</i> and large <i>p</i> that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10195070/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9511811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Tutorial on Generative Adversarial Networks with Application to Classification of Imbalanced Data. 生成对抗性网络教程及其在不平衡数据分类中的应用。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2022-10-01 Epub Date: 2021-12-31 DOI: 10.1002/sam.11570
Yuxiao Huang, Kara G Fields, Yan Ma
{"title":"A Tutorial on Generative Adversarial Networks with Application to Classification of Imbalanced Data.","authors":"Yuxiao Huang,&nbsp;Kara G Fields,&nbsp;Yan Ma","doi":"10.1002/sam.11570","DOIUrl":"10.1002/sam.11570","url":null,"abstract":"<p><p>A challenge unique to classification model development is imbalanced data. In a binary classification problem, class imbalance occurs when one class, the minority group, contains significantly fewer samples than the other class, the majority group. In imbalanced data, the minority class is often the class of interest (e.g., patients with disease). However, when training a classifier on imbalanced data, the model will exhibit bias towards the majority class and, in extreme cases, may ignore the minority class completely. A common strategy for addressing class imbalance is data augmentation. However, traditional data augmentation methods are associated with overfitting, where the model is fit to the noise in the data. In this tutorial we introduce an advanced method for data augmentation: Generative Adversarial Networks (GANs). The advantages of GANs over traditional data augmentation methods are illustrated using the Breast Cancer Wisconsin study. To promote the adoption of GANs for data augmentation, we present an end-to-end pipeline that encompasses the complete life cycle of a machine learning project along with alternatives and good practices both in the paper and in a separate video. Our code, data, full results and video tutorial are publicly available in the paper's github repository.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9529000/pdf/nihms-1766432.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33488851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Regression-Based Bayesian Estimation and Structure Learning for Nonparanormal Graphical Models. 基于回归的非超自然图形模型贝叶斯估计与结构学习。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2022-10-01 Epub Date: 2022-02-28 DOI: 10.1002/sam.11576
Jami J Mulgrave, Subhashis Ghosal
{"title":"Regression-Based Bayesian Estimation and Structure Learning for Nonparanormal Graphical Models.","authors":"Jami J Mulgrave,&nbsp;Subhashis Ghosal","doi":"10.1002/sam.11576","DOIUrl":"https://doi.org/10.1002/sam.11576","url":null,"abstract":"<p><p>A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables in which it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. We consider a Bayesian approach to inference in a nonparanormal graphical model in which we put priors on the unknown transformations through a random series based on B-splines. We use a regression formulation to construct the likelihood through the Cholesky decomposition on the underlying precision matrix of the transformed variables and put shrinkage priors on the regression coefficients. We apply a plug-in variational Bayesian algorithm for learning the sparse precision matrix and compare the performance to a posterior Gibbs sampling scheme in a simulation study. We finally apply the proposed methods to a microarray data set. The proposed methods have better performance as the dimension increases, and in particular, the variational Bayesian approach has the potential to speed up the estimation in the Bayesian nonparanormal graphical model without the Gaussianity assumption while retaining the information to construct the graph.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9455150/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33460008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data 驾驶风险的评估与解释:基于远程信息处理数据的汽车索赔频率建模
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2022-09-28 DOI: 10.1002/sam.11599
Yaqian Gao, Yifan Huang, Shengwang Meng
{"title":"Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data","authors":"Yaqian Gao, Yifan Huang, Shengwang Meng","doi":"10.1002/sam.11599","DOIUrl":"https://doi.org/10.1002/sam.11599","url":null,"abstract":"With the development of vehicle telematics and data mining technology, usage-based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low-risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A General Iterative Clustering Algorithm. 一种通用迭代聚类算法。
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2022-08-01 Epub Date: 2022-01-31 DOI: 10.1002/sam.11573
Ziqiang Lin, Eugene Laska, Carole Siegel
{"title":"A General Iterative Clustering Algorithm.","authors":"Ziqiang Lin,&nbsp;Eugene Laska,&nbsp;Carole Siegel","doi":"10.1002/sam.11573","DOIUrl":"https://doi.org/10.1002/sam.11573","url":null,"abstract":"<p><p>The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c8/08/SAM-15-433.PMC9438941.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40349011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-scale affinities with missing data: Estimation and applications. 有缺失数据的多尺度亲和力:估计与应用
IF 2.1 4区 数学
Statistical Analysis and Data Mining Pub Date : 2022-06-01 Epub Date: 2021-11-05 DOI: 10.1002/sam.11561
Min Zhang, Gal Mishne, Eric C Chi
{"title":"Multi-scale affinities with missing data: Estimation and applications.","authors":"Min Zhang, Gal Mishne, Eric C Chi","doi":"10.1002/sam.11561","DOIUrl":"10.1002/sam.11561","url":null,"abstract":"<p><p>Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216212/pdf/nihms-1751214.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9560853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信