Justin M. Leach, Nengjun Yi, Inmaculada Aban, None The Alzheimer's Disease Neuroimaging Initiative
{"title":"The spike-and-slab lasso and scalable algorithm to accommodate multinomial outcomes in variable selection problems","authors":"Justin M. Leach, Nengjun Yi, Inmaculada Aban, None The Alzheimer's Disease Neuroimaging Initiative","doi":"10.1080/02664763.2023.2258301","DOIUrl":"https://doi.org/10.1080/02664763.2023.2258301","url":null,"abstract":"AbstractSpike-and-slab prior distributions are used to impose variable selection in Bayesian regression-style problems with many possible predictors. These priors are a mixture of two zero-centered distributions with differing variances, resulting in different shrinkage levels on parameter estimates based on whether they are relevant to the outcome. The spike-and-slab lasso assigns mixtures of double exponential distributions as priors for the parameters. This framework was initially developed for linear models, later developed for generalized linear models, and shown to perform well in scenarios requiring sparse solutions. Standard formulations of generalized linear models cannot immediately accommodate categorical outcomes with > 2 categories, i.e. multinomial outcomes, and require modifications to model specification and parameter estimation. Such modifications are relatively straightforward in a Classical setting but require additional theoretical and computational considerations in Bayesian settings, which can depend on the choice of prior distributions for the parameters of interest. While previous developments of the spike-and-slab lasso focused on continuous, count, and/or binary outcomes, we generalize the spike-and-slab lasso to accommodate multinomial outcomes, developing both the theoretical basis for the model and an expectation-maximization algorithm to fit the model. To our knowledge, this is the first generalization of the spike-and-slab lasso to allow for multinomial outcomes.Keywords: Bayesian variable selectionspike-and-slabgeneralized linear modelsmultinomial outcomeselastic net Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementCode to reproduce the results of the simulation study and data analysis is available on GitHub (https://github.com/jmleach-bst/multinomial_ssnet_analyses). Note that while code for performing analysis on ADNI data is included, the ADNI data sets themselves are not, because we are not authorized to share data from ADNI. Details for access to these data can be found at http://adni.loni.usc.edu/data-samples/access-data/.Additional informationFundingData collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health [grant number U01 AG024904] and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Develop","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134912325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prediction and model evaluation for space-time data.","authors":"G L Watson, C E Reid, M Jerrett, D Telesca","doi":"10.1080/02664763.2023.2252208","DOIUrl":"10.1080/02664763.2023.2252208","url":null,"abstract":"<p><p>Evaluation metrics for prediction error, model selection and model averaging on space-time data are understudied and poorly understood. The absence of independent replication makes prediction ambiguous as a concept and renders evaluation procedures developed for independent data inappropriate for most space-time prediction problems. Motivated by air pollution data collected during California wildfires in 2008, this manuscript attempts a formalization of the true prediction error associated with spatial interpolation. We investigate a variety of cross-validation (CV) procedures employing both simulations and case studies to provide insight into the nature of the estimand targeted by alternative data partition strategies. Consistent with recent best practice, we find that location-based cross-validation is appropriate for estimating spatial interpolation error as in our analysis of the California wildfire data. Interestingly, commonly held notions of bias-variance trade-off of CV fold size do not trivially apply to dependent data, and we recommend leave-one-location-out (LOLO) CV as the preferred prediction error metric for spatial interpolation.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"1 1","pages":"2007-2024"},"PeriodicalIF":1.2,"publicationDate":"2023-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11271132/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41565191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arnold Stromberg, Jie Chen, Teresa Paula Costa Azinheira Oliveira, Yichuan Zhao, Ramin Moghaddass, Milan Stehlik
{"title":"Editorial to the special issue: statistical perspectives on analytics for COVID-19 data.","authors":"Arnold Stromberg, Jie Chen, Teresa Paula Costa Azinheira Oliveira, Yichuan Zhao, Ramin Moghaddass, Milan Stehlik","doi":"10.1080/02664763.2023.2228597","DOIUrl":"10.1080/02664763.2023.2228597","url":null,"abstract":"","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"50 11-12","pages":"2287-2293"},"PeriodicalIF":1.2,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10294239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding groups in data: an introduction to cluster analysis","authors":"Soumita Modak","doi":"10.1080/02664763.2023.2220087","DOIUrl":"https://doi.org/10.1080/02664763.2023.2220087","url":null,"abstract":"","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44099860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining phenotypic and genomic data to improve prediction of binary traits","authors":"D. Jarquin, A. Roy, B. Clarke, S. Ghosal","doi":"10.1080/02664763.2023.2208773","DOIUrl":"https://doi.org/10.1080/02664763.2023.2208773","url":null,"abstract":"Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here ‘main traits’) of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or ‘phenotypes’) that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136020907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knockoff procedure for false discovery rate control in high-dimensional data streams.","authors":"Ka Wai Tsang, Fugee Tsung, Zhihao Xu","doi":"10.1080/02664763.2023.2200496","DOIUrl":"10.1080/02664763.2023.2200496","url":null,"abstract":"<p><p>Motivated by applications to root-cause identification of faults in high-dimensional data streams that may have very limited samples after faults are detected, we consider multiple testing in models for multivariate statistical process control (SPC). With quick fault detection, only small portion of data streams being out-of-control (OC) can be assumed. It is a long standing problem to identify those OC data streams while controlling the number of false discoveries. It is challenging due to the limited number of OC samples after the termination of the process when faults are detected. Although several false discovery rate (FDR) controlling methods have been proposed, people may prefer other methods for quick detection. With a recently developed method called Knockoff filtering, we propose a knockoff procedure that can combine with other fault detection methods in the sense that the knockoff procedure does not change the stopping time, but may identify another set of faults to control FDR. A theorem for the FDR control of the proposed procedure is provided. Simulation studies show that the proposed procedure can control FDR while maintaining high power. We also illustrate the performance in an application to semiconductor manufacturing processes that motivated this development.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"50 14","pages":"2970-2983"},"PeriodicalIF":1.5,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41130200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection and estimation of multiple transient changes.","authors":"Michael Baron, Sergey V Malov","doi":"10.1080/02664763.2023.2174257","DOIUrl":"10.1080/02664763.2023.2174257","url":null,"abstract":"<p><p>Change-point detection methods are proposed for the case of temporary failures, or transient changes, when an unexpected disorder is ultimately followed by a re-adjustment and return to the initial state. A base distribution of the 'in-control' state changes to an 'out-of-control' distribution for unknown periods of time. Likelihood based sequential and retrospective tools are proposed for the detection and estimation of each pair of change-points. The accuracy of the obtained change-point estimates is assessed. Proposed methods offer simultaneous control of the familywise false alarm and false re-adjustment rates at the pre-chosen levels.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"50 14","pages":"2862-2888"},"PeriodicalIF":1.5,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557625/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41132383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling the spatial patterns of antenatal care utilization in Nigeria with inference based on Pólya-Gamma mixtures.","authors":"Osafu Augustine Egbon, Ezra Gayawan","doi":"10.1080/02664763.2022.2164561","DOIUrl":"10.1080/02664763.2022.2164561","url":null,"abstract":"<p><p>Despite the vast advantages of making antenatal care visits, the service utilization among pregnant women in Nigeria is suboptimal. A five-year monitoring estimate indicated that about 24% of the women who had live births made no visit. The non-utilization induced excessive zeroes in the outcome of interest. Thus, this study adopted a zero-inflated negative binomial model within a Bayesian framework to identify the spatial pattern and the key factors hindering antenatal care utilization in Nigeria. We overcome the intractability associated with posterior inference by adopting a Pólya-Gamma data-augmentation technique to facilitate inference. The Gibbs sampling algorithm was used to draw samples from the joint posterior distribution. Results revealed that type of place of residence, maternal level of education, access to mass media, household work index, and woman's working status have significant effects on the use of antenatal care services. Findings identified substantial state-level spatial disparity in antenatal care utilization across the country. Cost-effective techniques to achieve an acceptable frequency of utilization include the creation of a community-specific awareness to emphasize the importance and benefits of the appropriate utilization. Special consideration should be given to older pregnant women, women in poor antenatal utilization states, and women residing in poor road network regions.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"1 1","pages":"866-890"},"PeriodicalIF":1.5,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10956928/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41386591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A Tchorbadjieff, L P Tomov, V Velev, G Dezhov, V Manev, P Mayster
{"title":"On regime changes of COVID-19 outbreak.","authors":"A Tchorbadjieff, L P Tomov, V Velev, G Dezhov, V Manev, P Mayster","doi":"10.1080/02664763.2023.2177625","DOIUrl":"10.1080/02664763.2023.2177625","url":null,"abstract":"<p><p>The COVID-19 pandemic has had a very serious impact on societies and caused large-scale economic changes and death toll worldwide. The first cases were detected in China, but soon the virus spread quickly worldwide and the intensity of newly reported infections grew high during this initial period almost everywhere. Later, despite all imposed measures, the intensity shifted abruptly multiple times during the two-year period between 2020 and 2022 causing waves of too high infection rates in almost every part of the world. To target this problem, we assume the data heterogeneity as multiple consecutive regime changes. The research study includes the development of a model based on automatic regime change detection and their combination with the linear birth-death process for long-run data fits. The results are empirically verified on data for 38 countries and US states for the period from February 2020 to April 2022. Finally, the initial phase (conditions) properties of infection development are studied.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"50 11-12","pages":"2343-2359"},"PeriodicalIF":1.2,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10388815/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9922918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smoothing regression and impact measures for accidents of traffic flows","authors":"Zhou Yu, Jie Yang, Hsin-Hsiung Huang","doi":"10.1080/02664763.2023.2175799","DOIUrl":"https://doi.org/10.1080/02664763.2023.2175799","url":null,"abstract":"Traffic pattern identification and accident evaluation are essential for improving traffic planning, road safety, and traffic management. In this paper, we establish classification and regression models to characterize the relationship between traffic flows and different time points and identify different patterns of traffic flows by a negative binomial model with smoothing splines. It provides mean response curves and Bayesian credible bands for traffic flows, a single index, and the log-likelihood difference, for traffic flow pattern recognition. We further propose an impact measure for evaluating the influence of accidents on traffic flows based on the fitted negative binomial model. The proposed method has been successfully applied to real-world traffic flows, and it can be used for improving traffic management.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136097116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}