{"title":"Bayesian Semiparametric Joint Regression Analysis of Recurrent Adverse Events and Survival in Esophageal Cancer Patients.","authors":"Juhee Lee, Peter F Thall, Steven H Lin","doi":"10.1214/18-AOAS1182","DOIUrl":"10.1214/18-AOAS1182","url":null,"abstract":"<p><p>We propose a Bayesian semiparametric joint regression model for a recurrent event process and survival time. Assuming independent latent subject frailties, we define marginal models for the recurrent event process intensity and survival distribution as functions of the subject's frailty and baseline covariates. A robust Bayesian model, called Joint-DP, is obtained by assuming a Dirichlet process for the frailty distribution. We present a simulation study that compares posterior estimates under the Joint-DP model to a Bayesian joint model with lognormal frailties, a frequentist joint model, and marginal models for either the recurrent event process or survival time. The simulations show that the Joint-DP model does a good job of correcting for treatment assignment bias, and has favorable estimation reliability and accuracy compared with the alternative models. The Joint-DP model is applied to analyze an observational dataset from esophageal cancer patients treated with chemo-radiation, including the times of recurrent effusions of fluid to the heart or lungs, survival time, prognostic covariates, and radiation therapy modality.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"221-247"},"PeriodicalIF":1.3,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6824476/pdf/nihms969597.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JOINT MEAN AND COVARIANCE MODELING OF MULTIPLE HEALTH OUTCOME MEASURES.","authors":"Xiaoyue Niu, Peter D Hoff","doi":"10.1214/18-AOAS1187","DOIUrl":"https://doi.org/10.1214/18-AOAS1187","url":null,"abstract":"<p><p>Health exams determine a patient's health status by comparing the patient's measurement with a population reference range, a 95% interval derived from a homogeneous reference population. Similarly, most of the established relation among health problems are assumed to hold for the entire population. We use data from the 2009-2010 National Health and Nutrition Examination Survey (NHANES) on four major health problems in the U.S. and apply a joint mean and covariance model to study how the reference ranges and associations of those health outcomes could vary among subpopulations. We discuss guidelines for model selection and evaluation, using standard criteria such as AIC in conjunction with posterior predictive checks. The results from the proposed model can help identify subpopulations in which more data need to be collected to refine the reference range and to study the specific associations among those health problems.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"321-339"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1187","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAUSAL INFERENCE IN THE CONTEXT OF AN ERROR PRONE EXPOSURE: AIR POLLUTION AND MORTALITY.","authors":"Xiao Wu, Danielle Braun, Marianthi-Anna Kioumourtzoglou, Christine Choirat, Qian Di, Francesca Dominici","doi":"10.1214/18-AOAS1206","DOIUrl":"https://doi.org/10.1214/18-AOAS1206","url":null,"abstract":"<p><p>We propose a new approach for estimating causal effects when the exposure is measured with error and confounding adjustment is performed via a generalized propensity score (GPS). Using validation data, we propose a regression calibration (RC)-based adjustment for a continuous error-prone exposure combined with GPS to adjust for confounding (RC-GPS). The outcome analysis is conducted after transforming the corrected continuous exposure into a categorical exposure. We consider confounding adjustment in the context of GPS subclassification, inverse probability treatment weighting (IPTW) and matching. In simulations with varying degrees of exposure error and confounding bias, RC-GPS eliminates bias from exposure error and confounding compared to standard approaches that rely on the error-prone exposure. We applied RC-GPS to a rich data platform to estimate the causal effect of long-term exposure to fine particles (PM<sub>2.5</sub>) on mortality in New England for the period from 2000 to 2012. The main study consists of 2202 zip codes covered by 217,660 1 km × 1 km grid cells with yearly mortality rates, yearly PM<sub>2.5</sub> averages estimated from a spatio-temporal model (error-prone exposure) and several potential confounders. The internal validation study includes a subset of 83 1 km × 1 km grid cells within 75 zip codes from the main study with error-free yearly PM<sub>2.5</sub> exposures obtained from monitor stations. Under assumptions of noninterference and weak unconfoundedness, using matching we found that exposure to moderate levels of PM<sub>2.5</sub> (8 < PM<sub>2.5</sub> ≤ 10 <i>μ</i>g/m<sup>3</sup>) causes a 2.8% (95% CI: 0.6%, 3.6%) increase in all-cause mortality compared to low exposure (PM<sub>2.5</sub> ≤ 8 <i>μ</i>g/m<sup>3</sup>).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"520-547"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1206","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS.","authors":"Eugene Katsevich, Chiara Sabatti","doi":"10.1214/18-AOAS1185","DOIUrl":"https://doi.org/10.1214/18-AOAS1185","url":null,"abstract":"<p><p>We tackle the problem of selecting from among a large number of variables those that are \"important\" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [<i>Ann. Statist.</i> <b>43</b> (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [<i>J. Roy. Statist. Soc. Ser. B</i> <b>79</b> (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"1-33"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1185","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BAYESIAN ANALYSIS OF INFANT'S GROWTH DYNAMICS WITH <i>IN UTERO</i> EXPOSURE TO ENVIRONMENTAL TOXICANTS.","authors":"Jonggyu Baek, Bin Zhu, Peter X K Song","doi":"10.1214/18-aoas1199","DOIUrl":"10.1214/18-aoas1199","url":null,"abstract":"<p><p>Early infancy from at-birth to 3 years is critical for cognitive, emotional and social development of infants. During this period, infant's developmental tempo and outcomes are potentially impacted by <i>in utero</i> exposure to endocrine disrupting compounds (EDCs), such as bisphenol A (BPA) and phthalates. We investigate effects of ten ubiquitous EDCs on the infant growth dynamics of body mass index (BMI) in a birth cohort study.Modeling growth acceleration is proposed to understand the \"force of growth\" through a class of semiparametric stochastic velocity models. The great flexibility of such a dynamic model enables us to capture subject-specific dynamics of growth trajectories and to assess effects of the EDCs on potential delay of growth. We adopted a Bayesian method with the Ornstein-Uhlenbeck process as the prior for the growth rate function, in which the World Health Organization global infant's growth curves were integrated into our analysis. We found that BPA and most of phthalates exposed during the first trimester of pregnancy were inversely associated with BMI growth acceleration, resulting in a delayed achievement of infant BMI peak. Such early growth deficiency has been reported as a profound impact on health outcomes in puberty (e.g., timing of sexual maturation) and adulthood.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"297-320"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71428742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EXACT SPIKE TRAIN INFERENCE VIA ℓ<sub>0</sub> OPTIMIZATION.","authors":"Sean Jewell, Daniela Witten","doi":"10.1214/18-AOAS1162","DOIUrl":"10.1214/18-AOAS1162","url":null,"abstract":"<p><p>In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a <i>fluorescence trace</i> is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ<sub>1</sub> penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ<sub>1</sub> penalty with an ℓ<sub>0</sub> penalty. In stark contrast to the conventional wisdom that ℓ<sub>0</sub> optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ<sub>1</sub> proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2457-2482"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322847/pdf/nihms-997321.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36849823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Hybrid Traits for Comorbidity and Genetic Studies of Alcohol and Nicotine Co-Dependence.","authors":"Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi","doi":"10.1214/18-AOAS1156","DOIUrl":"10.1214/18-AOAS1156","url":null,"abstract":"<p><p>We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2359-2378"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6338437/pdf/nihms-997314.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36883672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva
{"title":"A SIMULATION-BASED FRAMEWORK FOR ASSESSING THE FEASIBILITY OF RESPONDENT-DRIVEN SAMPLING FOR ESTIMATING CHARACTERISTICS IN POPULATIONS OF LESBIAN, GAY AND BISEXUAL OLDER ADULTS.","authors":"Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva","doi":"10.1214/18-AOAS1151","DOIUrl":"10.1214/18-AOAS1151","url":null,"abstract":"<p><p>Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2252-2278"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6800244/pdf/nihms-1052724.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA.","authors":"Ashley Petersen, Noah Simon, Daniela Witten","doi":"10.1214/18-AOAS1159","DOIUrl":"10.1214/18-AOAS1159","url":null,"abstract":"In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called \"calcium imaging\" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2430-2456"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1159","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36746524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish
{"title":"The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments.","authors":"Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish","doi":"10.1214/18-AOAS1144","DOIUrl":"10.1214/18-AOAS1144","url":null,"abstract":"<p><p>An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2075-2095"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1144","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36763424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}