{"title":"A dynamic screening algorithm for hierarchical binary marketing data","authors":"Yimei Fan, Yuan Liao, I. Ryzhov, Kunpeng Zhang","doi":"10.1214/22-aoas1720","DOIUrl":"https://doi.org/10.1214/22-aoas1720","url":null,"abstract":"In many applications of business and marketing analytics, predictive models are fit using hierarchically structured data: common characteristics of products, customers, or webpages are represented as categorical variables, and each category can be split up into multiple subcategories at a lower level of the hierarchy. The model may thus contain hundreds of thousands of binary variables, necessitating the use of variable selection to screen out large numbers of irrelevant or insignificant features. We propose a new dynamic screening method, based on the distance correlation criterion, designed for hierarchical binary data. Our method can screen out large parts of the hierarchy at the higher levels, avoiding the need to explore many lower-level features and greatly reducing the computational cost of screening. The practical potential of the method is demonstrated in a case application on user-brand interaction data from Facebook.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":" 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114051769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Colman Humphrey, Ryan Gross, Dylan S. Small, Shane T. Jensen
{"title":"Using predictability to improve matching of urban locations in Philadelphia","authors":"Colman Humphrey, Ryan Gross, Dylan S. Small, Shane T. Jensen","doi":"10.1214/23-aoas1739","DOIUrl":"https://doi.org/10.1214/23-aoas1739","url":null,"abstract":"","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130561784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graham C. Gibson, Nicholas G. Reich, Daniel Sheldon
{"title":"Real-time mechanistic Bayesian forecasts of COVID-19 mortality","authors":"Graham C. Gibson, Nicholas G. Reich, Daniel Sheldon","doi":"10.1214/22-aoas1671","DOIUrl":"https://doi.org/10.1214/22-aoas1671","url":null,"abstract":"The COVID-19 pandemic emerged in late December 2019. In the first six months of the global outbreak, the U.S. reported more cases and deaths than any other country in the world. Effective modeling of the course of the pandemic can help assist with public health resource planning, intervention efforts, and vaccine clinical trials. However, building applied forecasting models presents unique challenges during a pandemic. First, case data available to models in real time represent a nonstationary fraction of the true case incidence due to changes in available diagnostic tests and test-seeking behavior. Second, interventions varied across time and geography leading to large changes in transmissibility over the course of the pandemic. We propose a mechanistic Bayesian model that builds upon the classic compartmental susceptible–exposed–infected–recovered (SEIR) model to operationalize COVID-19 forecasting in real time. This framework includes nonparametric modeling of varying transmission rates, nonparametric modeling of case and death discrepancies due to testing and reporting issues, and a joint observation likelihood on new case counts and new deaths; it is implemented in a probabilistic programming language to automate the use of Bayesian reasoning for quantifying uncertainty in probabilistic forecasts. The model has been used to submit forecasts to the U.S. Centers for Disease Control through the COVID-19 Forecast Hub under the name MechBayes. We examine the performance relative to a baseline model as well as alternate models submitted to the forecast hub. Additionally, we include an ablation test of our extensions to the classic SEIR model. We demonstrate a significant gain in both point and probabilistic forecast scoring measures using MechBayes, when compared to a baseline model, and show that MechBayes ranks as one of the top two models out of nine which regularly submitted to the COVID-19 Forecast Hub for the duration of the pandemic, trailing only the COVID-19 Forecast Hub ensemble model of which which MechBayes is a part.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135200733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Melody Y. Huang, Naoki Egami, E. Hartman, Luke W. Miratrix
{"title":"Leveraging population outcomes to improve the generalization of experimental results: Application to the JTPA study","authors":"Melody Y. Huang, Naoki Egami, E. Hartman, Luke W. Miratrix","doi":"10.1214/22-aoas1712","DOIUrl":"https://doi.org/10.1214/22-aoas1712","url":null,"abstract":"Generalizing causal estimates in randomized experiments to a broader target population is essential for guiding decisions by policymakers and practitioners in the social and biomedical sciences. While recent papers developed various weighting estimators for the population average treatment effect (PATE), many of these methods result in large variance because the experimental sample often differs substantially from the target population, and estimated sampling weights are extreme. We investigate this practical problem motivated by an evaluation study of the Job Training Partnership Act (JTPA), where we examine how well we can generalize the causal effect of job training programs beyond a specific population of economically disadvantaged adults and youths. In particular, we propose post-residualized weighting in which we use the outcome measured in the observational population data to build a flexible predictive model (e.g., machine learning methods) and residualize the outcome in the experimental data before using conventional weighting methods. We show that the proposed PATE estimator is consistent under the same assumptions required for existing weighting methods, impor-tantly without assuming the correct specification of the predictive model. We demonstrate the efficiency gains from this approach through our JTPA application: we find a between 5 and 25% reduction in variance.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian model selection: Application to the adjustment of fundamental physical constants","authors":"Olha Bodnar, V. Eriksson","doi":"10.1214/22-aoas1710","DOIUrl":"https://doi.org/10.1214/22-aoas1710","url":null,"abstract":"A method originally suggested by Raymond Birge, using what came to be known as the Birge ratio , has been widely used in metrology and physics for the adjustment of fundamental physical constants, particularly in the pe-riodic reevaluation carried out by the Task Group on Fundamental Physical Constants of CODATA (the Committee on Data of the International Science Council). The method involves increasing the reported uncertainties by a multiplicative factor large enough to make the measurement results mutually con-sistent. An alternative approach, predominant in the meta-analysis of medical studies, involves inflating the reported uncertainties by combining them, using the root sum of squares, with a sufficiently large constant (often dubbed dark uncertainty ) that is estimated from the data. In this contribution, we establish a connection between the method based on the Birge ratio and the location-scale model, which allows one to combine the results of various studies, while the additive adjustment is reviewed in the usual context of random effects models. Framing these alternative approaches as statistical models facilitates a quantitative comparison of them using statistical tools for model comparison. The intrinsic Bayes factor (IBF) is derived for the Berger and Bernardo reference prior, and then it is used to select a model for a set of measurements of the Newtonian constant of gravitation (“Big G”) to estimate a consensus value for this constant and to evaluate the associated uncertainty. Our empirical findings support the method based on the Birge ratio. The same conclusion is reached when the IBF corresponding to the Jeffreys prior is used and also when the comparison is based on the Akaike information criterion (AIC). Finally, the results of a simulation study indicate that the suggested procedure for model selection provides clear guid-ance even when the data comprise only a small number of measurements.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124895458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Signal-noise ratio of genetic associations and statistical power of SNP-set tests","authors":"Hong Zhang, Ming-Te Liu, Jiashun Jin, Zheyang Wu","doi":"10.1214/22-aoas1725","DOIUrl":"https://doi.org/10.1214/22-aoas1725","url":null,"abstract":"The SNP-set analysis is a powerful tool for dissecting the genetics of complex human diseases. There are three fundamental genetic association approaches to SNR-set analysis: the marginal model fitting approach, the joint model fitting approach, and the decorrelation approach. A problem of primary interest is how these approaches compare with each other. To address this problem, we develop a theoretical platform to compare the signal-to-noise ratio (SNR) of these approaches under the generalized linear model. We elaborate how causal genetic effects give rise to statistically detectable association signals, and show that when causal effects spread over blocks of strong linkage disequilibrium (LD), the SNR of the marginal model fitting is usually higher than that of the decorrelation approach, which in turn is higher than that of the unbiased joint model fitting approach. We also scrutinize dense effects and LDs by a bivariate model and extensive simulations using the 1000 Genome Project data. Last, we compare the statistical power of two generic types of SNP-set tests (summation-based and supremum-based) by simulations and an osteoporosis study using large data from UK Biobank. Our results help develop powerful tools for SNP-set analysis and understand the signal detection problem in the presence of colored noise.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121864935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trambak Banerjee, Peng Liu, Gourab Mukherjee, Shantanu Dutta, Hai Che
{"title":"Joint modeling of playing time and purchase propensity in massively multiplayer online role-playing games using crossed random effects","authors":"Trambak Banerjee, Peng Liu, Gourab Mukherjee, Shantanu Dutta, Hai Che","doi":"10.1214/23-aoas1731","DOIUrl":"https://doi.org/10.1214/23-aoas1731","url":null,"abstract":"Massively Multiplayer Online Role Playing Games (MMORPGs) offer a unique blend of a personalized gaming experience and a platform for forging social connections. Managers of these digital products usually rely on predictions of key player responses, such as playing time and purchase propensity, to design timely interventions for promoting, engaging and monetizing their playing base. However, the longitudinal data associated with these MMORPGs not only exhibit a large set of potential predictors to choose from but often present several other distinctive characteristics that pose significant challenges in developing flexible statistical algorithms that can generate efficient predictions of future player activities. For instance, the existence of virtual communities or guilds in these games complicate prediction since players who are part of the same guild have correlated behaviors and the guilds themselves evolve over time and, thus, have a dynamic effect on the future playing behavior of its members. In this paper, we develop a Crossed Random Effects Joint Modeling (CREJM) framework for analyzing correlated player responses in MMORPGs. Contrary to existing methods that assume player independence, CREJM is flexible enough to incorporate both player dependence as well as time varying guild effects on the future playing behavior of the guild members. On a large-scale data from a popular MMORPG, CREJM conducts simultaneous selection of fixed and random effects in high-dimensional penalized multivariate mixed models. We study the asymptotic properties of the variable selection procedure in CREJM and establish its selection consistency. Besides providing superior predictions of daily playing time and purchase propensity over competing methods, CREJM also predicts player correlations within each guild which are valuable for optimizing future promotional and reward policies for these virtual communities.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130804053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic prediction of residual life with longitudinal covariates using long short-term memory networks","authors":"Grace Rhodes, Marie Davidian, Wenbin Lu","doi":"10.1214/22-aoas1706","DOIUrl":"https://doi.org/10.1214/22-aoas1706","url":null,"abstract":"Sepsis, a complex medical condition that involves severe infections with life-threatening organ dysfunction, is a leading cause of death worldwide. Treatment of sepsis is highly challenging. When making treatment decisions, clinicians and patients desire accurate predictions of mean residual life (MRL) that leverage all available patient information, including longitudinal biomarker data. Biomarkers are biological, clinical, and other vari-ables reflecting disease progression that are often measured repeatedly on patients in the clinical setting. Dynamic prediction methods leverage accruing biomarker measurements to improve performance, providing updated predictions as new measurements become available. We introduce two methods for dynamic prediction of MRL using longitudinal biomarkers. In both methods, we begin by using long short-term memory networks (LSTMs) to construct encoded representations of the biomarker trajectories, referred to as “context vectors.” In our first method, the LSTM-GLM, we dynamically predict MRL via a transformed MRL model that includes the context vectors as covariates. In our second method, the LSTM-NN, we dynamically predict MRL from the context vectors using a feed-forward neural network. We demonstrate the improved performance of both proposed methods relative to competing methods in simulation studies. We apply the proposed methods to dynamically predict the restricted mean residual life (RMRL) of septic patients in the intensive care unit using electronic medical record data. We demonstrate that the LSTM-GLM and the LSTM-NN are useful tools for producing individualized, real-time predictions of RMRL that can help inform the treatment decisions of septic patients.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116898975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rebecca Anthopolos, Qixuan Chen, Joseph Sedransk, Mary Thompson, Gang Meng, Sandro Galea
{"title":"A Bayesian growth mixture model for complex survey data: Clustering postdisaster PTSD trajectories","authors":"Rebecca Anthopolos, Qixuan Chen, Joseph Sedransk, Mary Thompson, Gang Meng, Sandro Galea","doi":"10.1214/23-aoas1729","DOIUrl":"https://doi.org/10.1214/23-aoas1729","url":null,"abstract":"","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115457130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Postelection analysis of presidential election/poll data","authors":"Jiming Jiang, Yuanyuan Li, Peter X. K. Song","doi":"10.1214/22-aoas1707","DOIUrl":"https://doi.org/10.1214/22-aoas1707","url":null,"abstract":"This paper concerns analyses of the 2016 and 2020 U. S. presidential election data, including the data of pre-election polls and the actual elections. Our analyses unveil statistical evidence of discrepancy between the polls and real elections that is consistent across these two elections. Specifi-cally, the polls had consistently over-estimated advantages of the Democratic candidates, or, equivalently, under-estimated the true population support of the Republican candidate, Donald Trump, in both elections. The analyses are stratified by state, reflecting the U. S. electoral college system, by the means of small area estimation. We have found recurrent patterns suggesting that the polls have been underestimating the Republican candidate, especially in swing states of critical importance. Our findings also suggest an improvement of the 2020 polling methods to mitigate the size of underestimation. We show that a small-area model built upon the actual election data from one election can provide a better prediction than the poll-based projection to another election involving the same Republican candidate. Ranking of pollsters based on prediction bias using mixed model prediction is also considered.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126622214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}