Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu
{"title":"BAGEL: A BAYESIAN GRAPHICAL MODEL FOR INFERRING DRUG EFFECT LONGITUDINALLY ON DEPRESSION IN PEOPLE WITH HIV.","authors":"Yuliang Li, Yang Ni, Leah H Rubin, Amanda B Spence, Yanxun Xu","doi":"10.1214/21-AOAS1492","DOIUrl":"https://doi.org/10.1214/21-AOAS1492","url":null,"abstract":"<p><p>Access and adherence to antiretroviral therapy (ART) has transformed the face of HIV infection from a fatal to a chronic disease. However, ART is also known for its side effects. Studies have reported that ART is associated with depressive symptomatology. Large-scale HIV clinical databases with individuals' longitudinal depression records, ART medications, and clinical characteristics offer researchers unprecedented opportunities to study the effects of ART drugs on depression over time. We develop BAGEL, a Bayesian graphical model to investigate longitudinal effects of ART drugs on a range of depressive symptoms while adjusting for participants' demographic, behavior, and clinical characteristics, and taking into account the heterogeneous population through a Bayesian nonparametric prior. We evaluate BAGEL through simulation studies. Application to a dataset from the Women's Interagency HIV Study yields interpretable and clinically useful results. BAGEL not only can improve our understanding of ART drugs effects on disparate depression symptoms, but also has clinical utility in guiding informed and effective treatment selection to facilitate precision medicine in HIV.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 1","pages":"21-39"},"PeriodicalIF":1.8,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9236217/pdf/nihms-1778597.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10737070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron
{"title":"JOINT AND INDIVIDUAL ANALYSIS OF BREAST CANCER HISTOLOGIC IMAGES AND GENOMIC COVARIATES.","authors":"Iain Carmichael, Benjamin C Calhoun, Katherine A Hoadley, Melissa A Troester, Joseph Geradts, Heather D Couture, Linnea Olsson, Charles M Perou, Marc Niethammer, Jan Hannig, J S Marron","doi":"10.1214/20-aoas1433","DOIUrl":"10.1214/20-aoas1433","url":null,"abstract":"<p><p>The two main approaches in the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genomics. While both histopathology and genomics are fundamental to cancer research, the connections between these fields have been relatively superficial. We bridge this gap by investigating the Carolina Breast Cancer Study through the development of an integrative, exploratory analysis framework. Our analysis gives insights - some known, some novel - that are engaging to both pathologists and geneticists. Our analysis framework is based on Angle-based Joint and Individual Variation Explained (AJIVE) for statistical data integration and exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction. CNNs raise interpretability issues that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 4","pages":"1697-1722"},"PeriodicalIF":1.3,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9007558/pdf/nihms-1780328.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10147676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton
{"title":"ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS.","authors":"Brady T West, Roderick J Little, Rebecca R Andridge, Philip S Boonstra, Erin B Ware, Anita Pandit, Fernanda Alvarado-Leiton","doi":"10.1214/21-aoas1453","DOIUrl":"https://doi.org/10.1214/21-aoas1453","url":null,"abstract":"<p><p>Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 3","pages":"1556-1581"},"PeriodicalIF":1.8,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8887878/pdf/nihms-1773953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10686307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton
{"title":"A COMPOSITIONAL MODEL TO ASSESS EXPRESSION CHANGES FROM SINGLE-CELL RNA-SEQ DATA.","authors":"Xiuyu Ma, Keegan Korthauer, Christina Kendziorski, Michael A Newton","doi":"10.1214/20-aoas1423","DOIUrl":"https://doi.org/10.1214/20-aoas1423","url":null,"abstract":"<p><p>On the problem of scoring genes for evidence of changes in the distribution of single-cell expression, we introduce an empirical Bayesian mixture approach and evaluate its operating characteristics in a range of numerical experiments. The proposed approach leverages cell-subtype structure revealed in cluster analysis in order to boost gene-level information on expression changes. Cell clustering informs gene-level analysis through a specially-constructed prior distribution over pairs of multinomial probability vectors; this prior meshes with available model-based tools that score patterns of differential expression over multiple subtypes. We derive an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition. Advantage is gained by the compositional structure of the model not only in which a host of gene-specific mixture components are allowed but also in which the mixing proportions are constrained at the whole cell level. This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution. The result, according to our numerical experiments, is improved sensitivity compared to several standard approaches for detecting distributional expression changes.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 2","pages":"880-901"},"PeriodicalIF":1.8,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10275512/pdf/nihms-1901161.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9762402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.","authors":"David K Lim, Naim U Rashid, Joseph G Ibrahim","doi":"10.1214/20-aoas1407","DOIUrl":"10.1214/20-aoas1407","url":null,"abstract":"<p><p>Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown <i>a priori</i> what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 1","pages":"481-508"},"PeriodicalIF":1.8,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8386505/pdf/nihms-1716637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9546884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brian J Reich, Yawen Guan, Denis Fourches, Joshua L Warren, Stefanie E Sarnat, Howard H Chang
{"title":"INTEGRATIVE STATISTICAL METHODS FOR EXPOSURE MIXTURES AND HEALTH.","authors":"Brian J Reich, Yawen Guan, Denis Fourches, Joshua L Warren, Stefanie E Sarnat, Howard H Chang","doi":"10.1214/20-AOAS1364","DOIUrl":"https://doi.org/10.1214/20-AOAS1364","url":null,"abstract":"<p><p>Humans are concurrently exposed to chemically, structurally and toxicologically diverse chemicals. A critical challenge for environmental epidemiology is to quantify the risk of adverse health outcomes resulting from exposures to such chemical mixtures and to identify which mixture constituents may be driving etiologic associations. A variety of statistical methods have been proposed to address these critical research questions. However, they generally rely solely on measured exposure and health data available within a specific study. Advancements in understanding of the role of mixtures on human health impacts may be better achieved through the utilization of external data and knowledge from multiple disciplines with innovative statistical tools. In this paper we develop new methods for health analyses that incorporate auxiliary information about the chemicals in a mixture, such as physicochemical, structural and/or toxicological data. We expect that the constituents identified using auxiliary information will be more biologically meaningful than those identified by methods that solely utilize observed correlations between measured exposure. We develop flexible Bayesian models by specifying prior distributions for the exposures and their effects that include auxiliary information and examine this idea over a spectrum of analyses from regression to factor analysis. The methods are applied to study the effects of volatile organic compounds on emergency room visits in Atlanta. We find that including cheminformatic information about the exposure variables improves prediction and provides a more interpretable model for emergency room visits for respiratory diseases.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 4","pages":"1945-1963"},"PeriodicalIF":1.8,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914338/pdf/nihms-1780774.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10265042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LOG-CONTRAST REGRESSION WITH FUNCTIONAL COMPOSITIONAL PREDICTORS: LINKING PRETERM INFANT'S GUT MICROBIOME TRAJECTORIES TO NEUROBEHAVIORAL OUTCOME.","authors":"Zhe Sun, Wanli Xu, Xiaomei Cong, Gen Li, Kun Chen","doi":"10.1214/20-aoas1357","DOIUrl":"https://doi.org/10.1214/20-aoas1357","url":null,"abstract":"<p><p>The neonatal intensive care unit (NICU) experience is known to be one of the most crucial factors that drive preterm infant's neurodevelopmental and health outcome. It is hypothesized that stressful early life experience of very preterm neonate is imprinting gut microbiome by the regulation of the so-called brain-gut axis, and consequently, certain microbiome markers are predictive of later infant neurodevelopment. To investigate, a preterm infant study was conducted; infant fecal samples were collected during the infants' first month of postnatal age, resulting in functional compositional microbiome data, and neurobehavioral outcomes were measured when infants reached 36-38 weeks of post-menstrual age. To identify potential microbiome markers and estimate how the trajectories of gut microbiome compositions during early postnatal stage impact later neurobehavioral outcomes of the preterm infants, we innovate a sparse log-contrast regression with functional compositional predictors. The functional simplex structure is strictly preserved, and the functional compositional predictors are allowed to have sparse, smoothly varying, and accumulating effects on the outcome through time. Through a pragmatic basis expansion step, the problem boils down to a linearly constrained sparse group regression, for which we develop an efficient algorithm and obtain theoretical performance guarantees. Our approach yields insightful results in the preterm infant study. The identified microbiome markers and the estimated time dynamics of their impact on the neurobehavioral outcome shed lights on the linkage between stress accumulation in early postnatal stage and neurodevelpomental process of infants.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 3","pages":"1535-1556"},"PeriodicalIF":1.8,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218926/pdf/nihms-1601428.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39100587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inferring a consensus problem list using penalized multistage models for ordered data.","authors":"Philip S Boonstra, John C Krauss","doi":"10.1214/20-aoas1361","DOIUrl":"10.1214/20-aoas1361","url":null,"abstract":"<p><p>A patient's medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient's problem in the electronic medical record actually reflect a consensus communication of that patient's current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. We conduct a simulation study to evaluate the performance of our methodology and then analyze the motivating problem list data. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorated; in contrast, physicians broadly disagreed on the relevant problems for the middle - and most difficult - case abstract.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 3","pages":"1557-1580"},"PeriodicalIF":1.8,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8345315/pdf/nihms-1696242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39291448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STATISTICAL METHODS FOR ANALYSIS OF COMBINED CATEGORICAL BIOMARKER DATA FROM MULTIPLE STUDIES.","authors":"Chao Cheng, Molin Wang","doi":"10.1214/20-aoas1337","DOIUrl":"10.1214/20-aoas1337","url":null,"abstract":"<p><p>In the analysis of pooled data from multiple studies involving a biomarker exposure, the biomarker measurements can vary across laboratories and usually require calibration to a reference assay prior to pooling. Previous researches consider the measurements from a reference laboratory as the gold standard, even though measurements in the reference laboratory are not necessarily closer to the underlying truth in reality. In this paper we do not treat any laboratory measurements as the gold standard, and we develop two statistical methods, the exact calibration and cut-off calibration methods, for the analysis of aggregated categorical biomarker data. We compare the performance of both methods for estimating the biomarker-disease relationship under a random sample or controls-only calibration design. Our findings include: (1) the exact calibration method provides significantly less biased estimates and more accurate confidence intervals than the other method; (2) the cut-off calibration method could yield estimates with minimal bias and valid confidence intervals under small measurement errors and/or small exposure effects; (3) controls-only calibration design can result in additional bias, but the bias is minimal if the exposure effects and/or disease prevalences are small. Finally, we illustrate the methods in an application evaluating the relationship between circulating vitamin D levels and colorectal cancer risk in a pooling project.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 3","pages":"1146-1163"},"PeriodicalIF":1.8,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7903924/pdf/nihms-1669923.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25407136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natalie Klein, Josue Orellana, Scott L Brincat, Earl K Miller, Robert E Kass
{"title":"TORUS GRAPHS FOR MULTIVARIATE PHASE COUPLING ANALYSIS.","authors":"Natalie Klein, Josue Orellana, Scott L Brincat, Earl K Miller, Robert E Kass","doi":"10.1214/19-aoas1300","DOIUrl":"10.1214/19-aoas1300","url":null,"abstract":"<p><p>Angular measurements are often modeled as circular random variables, where there are natural circular analogues of moments, including correlation. Because a product of circles is a torus, a <i>d</i>-dimensional vector of circular random variables lies on a <i>d</i>-dimensional torus. For such vectors we present here a class of graphical models, which we call <i>torus graphs</i>, based on the full exponential family with pairwise interactions. The topological distinction between a torus and Euclidean space has several important consequences. Our development was motivated by the problem of identifying phase coupling among oscillatory signals recorded from multiple electrodes in the brain: oscillatory phases across electrodes might tend to advance or recede together, indicating coordination across brain areas. The data analyzed here consisted of 24 phase angles measured repeatedly across 840 experimental trials (replications) during a memory task, where the electrodes were in 4 distinct brain regions, all known to be active while memories are being stored or retrieved. In realistic numerical simulations, we found that a standard pairwise assessment, known as phase locking value, is unable to describe multivariate phase interactions, but that torus graphs can accurately identify conditional associations. Torus graphs generalize several more restrictive approaches that have appeared in various scientific literatures, and produced intuitive results in the data we analyzed. Torus graphs thus unify multivariate analysis of circular data and present fertile territory for future research.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"14 2","pages":"635-660"},"PeriodicalIF":1.3,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9812283/pdf/nihms-1716022.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10503612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}