Chelsea Krantsevich, P. Richard Hahn, Yi Zheng, Charles Katz
{"title":"Bayesian decision theory for tree-based adaptive screening tests with an application to youth delinquency","authors":"Chelsea Krantsevich, P. Richard Hahn, Yi Zheng, Charles Katz","doi":"10.1214/22-aoas1657","DOIUrl":"https://doi.org/10.1214/22-aoas1657","url":null,"abstract":"Crime prevention strategies based on early intervention depend on accurate risk assessment instruments for identifying high-risk youth. It is important in this context that the instruments be convenient to administer, which means, in particular, that they should also be reasonably brief; adaptive screening tests are useful for this purpose. Adaptive tests constructed using classification and regression trees are becoming a popular alternative to traditional item response theory (IRT) approaches for adaptive testing. However, tree-based adaptive tests lack a principled criterion for terminating the test. This paper develops a Bayesian decision theory framework for measuring the trade-off between brevity and accuracy when considering tree-based adaptive screening tests of different lengths. We also present a novel method for designing tree-based adaptive tests, motivated by this framework. The framework and associated adaptive test method are demonstrated through an application to youth delinquency risk assessment in Honduras; it is shown that an adaptive test requiring a subject to answer fewer than 10 questions can identify high-risk youth nearly as accurately as an unabridged survey containing 173 items.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135219500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Surrogate marker assessment using mediation and instrumental variable analyses in a case-cohort design","authors":"Yen-Tsung Huang, Jih-Chang Yu, Jui-Hsiang Lin","doi":"10.1214/22-aoas1667","DOIUrl":"https://doi.org/10.1214/22-aoas1667","url":null,"abstract":"The identification of surrogate markers for gold standard outcomes in clinical trials enables future cost-effective trials that target the identified markers. Due to resource limitations, these surrogate markers may be collected only for cases and for a subset of the trial cohort, giving rise to what is termed the case-cohort design. Motivated by a COVID-19 vaccine trial, we propose methods of assessing the surrogate markers for a time-to-event outcome in a case-cohort design by using mediation and instrumental variable (IV) analyses. In the mediation analysis we decomposed the vaccine effect on COVID-19 risk into an indirect effect (the effect mediated through the surrogate marker such as neutralizing antibodies) and a direct effect (the effect not mediated by the marker), and we propose that the mediation proportions are surrogacy indices. In the IV analysis we aimed to quantify the causal effect of the surrogate marker on disease risk in the presence of surrogatedisease confounding which is unavoidable even in randomized trials. We employed weighted estimating equations derived from nonparametric maximum likelihood estimators (NPMLEs) under semiparametric probit models for the time-to-disease outcome. We plugged in the weighted NPMLEs to construct estimators for the aforementioned causal effects and surrogacy indices, and we determined the asymptotic properties of the proposed estimators. Finite sample performance was evaluated in numerical simulations. Applying the proposed mediation and IV analyses to a mock COVID-19 vaccine trial data, we found that 84.2% of the vaccine efficacy was mediated by 50% pseudovirus neutralizing antibody and that neutralizing antibodies had significant protective effects for COVID-19 risk.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129688644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simone Tiberi, Helena L. Crowell, Pantelis Samartsidis, Lukas M. Weber, Mark D. Robinson
{"title":"distinct: A novel approach to differential distribution analyses","authors":"Simone Tiberi, Helena L. Crowell, Pantelis Samartsidis, Lukas M. Weber, Mark D. Robinson","doi":"10.1214/22-aoas1689","DOIUrl":"https://doi.org/10.1214/22-aoas1689","url":null,"abstract":"We present distinct, a general method for differential analysis of full distributions that is well suited to applications on single-cell data, such as single-cell RNA sequencing and high-dimensional flow or mass cytometry data. High-throughput single-cell data reveal an unprecedented view of cell identity and allow complex variations between conditions to be discovered; nonetheless, most methods for differential expression target differences in the mean and struggle to identify changes where the mean is only marginally affected. distinct is based on a hierarchical nonparametric permutation approach and, by comparing empirical cumulative distribution functions, identifies both differential patterns involving changes in the mean as well as more subtle variations that do not involve the mean. We performed extensive benchmarks across both simulated and experimental datasets from single-cell RNA sequencing and mass cytometry data, where distinct shows favourable performance, identifies more differential patterns than competitors, and displays good control of false positive and false discovery rates. distinct is available as a Bioconductor R package.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135525103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rahman, K. Khare, G. Michailidis, C. Mart́ınez, J. Carulla
{"title":"Estimation of Gaussian directed acyclic graphs using partial ordering information with applications to DREAM3 networks and dairy cattle data","authors":"S. Rahman, K. Khare, G. Michailidis, C. Mart́ınez, J. Carulla","doi":"10.1214/22-aoas1636","DOIUrl":"https://doi.org/10.1214/22-aoas1636","url":null,"abstract":"Estimating a directed acyclic graph (DAG) from observational data represents a canonical learning problem and has generated a lot of interest in recent years. Research has focused mostly on the following two cases: when no information regarding the ordering of the nodes in the DAG is available, and when a domain-specific complete ordering of the nodes is available. In this paper, motivated by a recent application in dairy science, we develop a method for DAG estimation for the middle scenario, where partition based partial ordering of the nodes is known based on domain-specific knowledge. We develop an efficient algorithm that solves the posited problem, coined Partition-DAG. Through extensive simulations using the DREAM3 Yeast networks, we illustrate that Partition-DAG effectively incorporates the partial ordering information to improve both speed and accuracy. We then illustrate the usefulness of Partition-DAG by applying it to recently collected dairy cattle data, and inferring relationships between various variables involved in dairy agroecosystems. two sets (ParDAG-2), PC algorithm (PC), Stable PC algorithm (PC-STAB), Parallel-PC algorithm (PC-PAR), PC algorithm with background knowledge partition (PCBGK-2). algorithm variants partially oriented values non-oriented","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126768628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient doubly-robust imputation framework for longitudinal dropout, with an application to an Alzheimer’s clinical trial","authors":"Yuqi Qiu, Karen Messer","doi":"10.1214/23-AOAS1728","DOIUrl":"https://doi.org/10.1214/23-AOAS1728","url":null,"abstract":"We develop a novel doubly-robust (DR) imputation framework for longitudinal studies with monotone dropout, motivated by the informative dropout that is common in FDA-regulated trials for Alzheimer's disease. In this approach, the missing data are first imputed using a doubly-robust augmented inverse probability weighting (AIPW) estimator, then the imputed completed data are substituted into a full-data estimating equation, and the estimate is obtained using standard software. The imputed completed data may be inspected and compared to the observed data, and standard model diagnostics are available. The same imputed completed data can be used for several different estimands, such as subgroup analyses in a clinical trial, allowing for reduced computation and increased consistency across analyses. We present two specific DR imputation estimators, AIPW-I and AIPW-S, study their theoretical properties, and investigate their performance by simulation. AIPW-S has substantially reduced computational burden compared to many other DR estimators, at the cost of some loss of efficiency and the requirement of stronger assumptions. Simulation studies support the theoretical properties and good performance of the DR imputation framework. Importantly, we demonstrate their ability to address time-varying covariates, such as a time by treatment interaction. We illustrate using data from a large randomized Phase III trial investigating the effect of donepezil in Alzheimer's disease, from the Alzheimer's Disease Cooperative Study (ADCS) group.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128427168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A semiparametric promotion time cure model with support vector machine","authors":"S. Pal, Wisdom Aselisewine","doi":"10.1214/23-aoas1741","DOIUrl":"https://doi.org/10.1214/23-aoas1741","url":null,"abstract":"The promotion time cure rate model (PCM) is an extensively studied model for the analysis of time-to-event data in the presence of a cured subgroup. There are several strategies proposed in the literature to model the latency part of PCM. However, there aren't many strategies proposed to investigate the effects of covariates on the incidence part of PCM. In this regard, most existing studies assume the boundary separating the cured and non-cured subjects with respect to the covariates to be linear. As such, they can only capture simple effects of the covariates on the cured/non-cured probability. In this manuscript, we propose a new promotion time cure model that uses the support vector machine (SVM) to model the incidence part. The proposed model inherits the features of the SVM and provides flexibility in capturing non-linearity in the data. To the best of our knowledge, this is the first work that integrates the SVM with PCM model. For the estimation of model parameters, we develop an expectation maximization algorithm where we make use of the sequential minimal optimization technique together with the Platt scaling method to obtain the posterior probabilities of cured/uncured. A detailed simulation study shows that the proposed model outperforms the existing logistic regression-based PCM model as well as the spline regression-based PCM model, which is also known to capture non linearity in the data. This is true in terms of bias and mean square error of different quantities of interest, and also in terms of predictive and classification accuracies of cure. Finally, we illustrate the applicability and superiority of our model using the data from a study on leukemia patients who went through bone marrow transplantation.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133145524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multivariate frequency-severity framework for healthcare data breaches","authors":"Hong Sun, Maochao Xu, P. Zhao","doi":"10.1214/22-aoas1625","DOIUrl":"https://doi.org/10.1214/22-aoas1625","url":null,"abstract":"Data breaches in healthcare have become a substantial concern in recent years, and cause millions of dollars in financial losses each year. It is fundamental for government regulators, insurance companies, and stakeholders to understand the breach frequency and the number of affected individuals in each state, as these are directly related to the federal Health Insurance Portability and Accountability Act (HIPAA) and state data breach laws. However, an obstacle to studying data breaches in healthcare is the lack of suitable statistical approaches. We develop a novel multivariate frequency-severity framework to analyze breach frequency and the number of affected individuals at the state level. A mixed effects model is developed to model the square root transformed frequency, and the log-gamma distribution is proposed to capture the skewness and heavy tail exhibited by the distribution of numbers of affected individuals. We further discover a positive nonlinear dependence between the transformed frequency and the log-transformed numbers of affected individuals (i.e., severity). In particular, we propose to use a D-vine copula to capture the multivariate dependence among conditional severities given frequencies due to its inherent temporal structure and rich bivariate copula families. The rejection sampling technique is developed to simulate the predictive distributions. Both the in-sample and out-of-sample studies show that the proposed multivariate frequency-severity model that accommodates non-linear dependence has satisfactory fitting and prediction performances.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130424091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Time-discretization approximation enriches continuous-time discrete-space models for animal movement","authors":"Joshua Hewitt, A. Gelfand, R. Schick","doi":"10.1214/22-aoas1649","DOIUrl":"https://doi.org/10.1214/22-aoas1649","url":null,"abstract":"","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123770972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tulio L. Criscuolo, R. Assunção, R. Loschi, W. Meira Jr., D. Cruz-Reyes
{"title":"Handling categorical features with many levels using a product partition model","authors":"Tulio L. Criscuolo, R. Assunção, R. Loschi, W. Meira Jr., D. Cruz-Reyes","doi":"10.1214/22-aoas1651","DOIUrl":"https://doi.org/10.1214/22-aoas1651","url":null,"abstract":"","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"260 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122755171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating the average treatment effect in randomized clinical trials with all-or-none compliance","authors":"Zhiwei Zhang, Zonghui Hu, D. Follmann, L. Nie","doi":"10.1214/22-aoas1627","DOIUrl":"https://doi.org/10.1214/22-aoas1627","url":null,"abstract":"Noncompliance is a common intercurrent event in randomized clinical trials that raises important questions about analytical objectives and approaches. Motivated by the Multiple Risk Factor Intervention Trial (MRFIT), we consider how to estimate the average treatment effect (ATE) in randomized trials with all-or-none compliance. Confounding is a major challenge in estimating the ATE, and conventional methods for confounding adjustment typically require the assumption of no unmeasured confounders, which may be difficult to justify. Using randomized treatment assignment as an instrumental variable, the ATE can be identified in the presence of unmeasured confounders under suitable assumptions, including an assumption that limits the effect-modifying activities of unmeasured confounders. We describe and compare several estimation methods based on different modeling assumptions. Some of these methods are able to incorporate information from auxiliary covariates for improved efficiency without introducing bias. The different methods are compared in a simulation study and applied to the MRFIT.","PeriodicalId":188068,"journal":{"name":"The Annals of Applied Statistics","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125113338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}