{"title":"MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS.","authors":"Eugene Katsevich, Chiara Sabatti","doi":"10.1214/18-AOAS1185","DOIUrl":"https://doi.org/10.1214/18-AOAS1185","url":null,"abstract":"<p><p>We tackle the problem of selecting from among a large number of variables those that are \"important\" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [<i>Ann. Statist.</i> <b>43</b> (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [<i>J. Roy. Statist. Soc. Ser. B</i> <b>79</b> (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"1-33"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1185","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BAYESIAN ANALYSIS OF INFANT'S GROWTH DYNAMICS WITH <i>IN UTERO</i> EXPOSURE TO ENVIRONMENTAL TOXICANTS.","authors":"Jonggyu Baek, Bin Zhu, Peter X K Song","doi":"10.1214/18-aoas1199","DOIUrl":"10.1214/18-aoas1199","url":null,"abstract":"<p><p>Early infancy from at-birth to 3 years is critical for cognitive, emotional and social development of infants. During this period, infant's developmental tempo and outcomes are potentially impacted by <i>in utero</i> exposure to endocrine disrupting compounds (EDCs), such as bisphenol A (BPA) and phthalates. We investigate effects of ten ubiquitous EDCs on the infant growth dynamics of body mass index (BMI) in a birth cohort study.Modeling growth acceleration is proposed to understand the \"force of growth\" through a class of semiparametric stochastic velocity models. The great flexibility of such a dynamic model enables us to capture subject-specific dynamics of growth trajectories and to assess effects of the EDCs on potential delay of growth. We adopted a Bayesian method with the Ornstein-Uhlenbeck process as the prior for the growth rate function, in which the World Health Organization global infant's growth curves were integrated into our analysis. We found that BPA and most of phthalates exposed during the first trimester of pregnancy were inversely associated with BMI growth acceleration, resulting in a delayed achievement of infant BMI peak. Such early growth deficiency has been reported as a profound impact on health outcomes in puberty (e.g., timing of sexual maturation) and adulthood.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"297-320"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71428742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EXACT SPIKE TRAIN INFERENCE VIA ℓ<sub>0</sub> OPTIMIZATION.","authors":"Sean Jewell, Daniela Witten","doi":"10.1214/18-AOAS1162","DOIUrl":"10.1214/18-AOAS1162","url":null,"abstract":"<p><p>In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a <i>fluorescence trace</i> is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ<sub>1</sub> penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ<sub>1</sub> penalty with an ℓ<sub>0</sub> penalty. In stark contrast to the conventional wisdom that ℓ<sub>0</sub> optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ<sub>1</sub> proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2457-2482"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322847/pdf/nihms-997321.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36849823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Hybrid Traits for Comorbidity and Genetic Studies of Alcohol and Nicotine Co-Dependence.","authors":"Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi","doi":"10.1214/18-AOAS1156","DOIUrl":"10.1214/18-AOAS1156","url":null,"abstract":"<p><p>We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2359-2378"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6338437/pdf/nihms-997314.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36883672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva
{"title":"A SIMULATION-BASED FRAMEWORK FOR ASSESSING THE FEASIBILITY OF RESPONDENT-DRIVEN SAMPLING FOR ESTIMATING CHARACTERISTICS IN POPULATIONS OF LESBIAN, GAY AND BISEXUAL OLDER ADULTS.","authors":"Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva","doi":"10.1214/18-AOAS1151","DOIUrl":"10.1214/18-AOAS1151","url":null,"abstract":"<p><p>Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2252-2278"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6800244/pdf/nihms-1052724.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA.","authors":"Ashley Petersen, Noah Simon, Daniela Witten","doi":"10.1214/18-AOAS1159","DOIUrl":"10.1214/18-AOAS1159","url":null,"abstract":"In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called \"calcium imaging\" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2430-2456"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1159","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36746524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish
{"title":"The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments.","authors":"Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish","doi":"10.1214/18-AOAS1144","DOIUrl":"10.1214/18-AOAS1144","url":null,"abstract":"<p><p>An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2075-2095"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1144","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36763424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander M Franks, Florian Markowetz, Edoardo M Airoldi
{"title":"REFINING CELLULAR PATHWAY MODELS USING AN ENSEMBLE OF HETEROGENEOUS DATA SOURCES.","authors":"Alexander M Franks, Florian Markowetz, Edoardo M Airoldi","doi":"10.1214/16-aoas915","DOIUrl":"10.1214/16-aoas915","url":null,"abstract":"<p><p>Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast <i>S. cerevisiae</i>.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 3","pages":"1361-1384"},"PeriodicalIF":1.8,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9733905/pdf/nihms-1823482.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10366316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A TESTING BASED APPROACH TO THE DISCOVERY OF DIFFERENTIALLY CORRELATED VARIABLE SETS.","authors":"By Kelly Bodwin, Kai Zhang, Andrew Nobel","doi":"10.1214/17-AOAS1083","DOIUrl":"10.1214/17-AOAS1083","url":null,"abstract":"<p><p>Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1180-1203"},"PeriodicalIF":1.8,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37486780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass
{"title":"ADJUSTED REGULARIZATION IN LATENT GRAPHICAL MODELS: APPLICATION TO MULTIPLE-NEURON SPIKE COUNT DATA.","authors":"Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass","doi":"10.1214/18-AOAS1190","DOIUrl":"10.1214/18-AOAS1190","url":null,"abstract":"<p><p>A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on <i>ℓ</i> <sub>1</sub>-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1068-1095"},"PeriodicalIF":1.3,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879176/pdf/nihms-1014977.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49684619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}