Ann Marie K Weideman, Rujin Wang, Joseph G Ibrahim, Yuchao Jiang
{"title":"Canopy2: Tumor Phylogeny Inference by Bulk DNA and Single-Cell RNA Sequencing.","authors":"Ann Marie K Weideman, Rujin Wang, Joseph G Ibrahim, Yuchao Jiang","doi":"10.1007/s12561-024-09466-1","DOIUrl":"10.1007/s12561-024-09466-1","url":null,"abstract":"<p><p>Tumors are comprised of a mixture of distinct cell populations that differ in terms of genetic makeup and function. Such heterogeneity plays a role in the development of drug resistance and the ineffectiveness of targeted cancer therapies. Insight into this complexity can be obtained through the construction of a phylogenetic tree, which illustrates the evolutionary lineage of tumor cells as they acquire mutations over time. We propose Canopy2, a Bayesian framework that uses single nucleotide variants derived from bulk DNA and single-cell RNA sequencing to infer tumor phylogeny and conduct mutational profiling of tumor subpopulations. Canopy2 uses Markov chain Monte Carlo methods to sample from a joint probability distribution involving a mixture of binomial and beta-binomial distributions, specifically chosen to account for the sparsity and stochasticity of the single-cell data. Canopy2 demystifies the sources of zeros in the single-cell data and separates zeros categorized as non-cancerous (cells without mutations), stochastic (mutations not expressed due to bursting), and technical (expressed mutations not picked up by sequencing). Simulations demonstrate that Canopy2 consistently outperforms competing methods and reconstructs the clonal tree with high fidelity, even in situations involving low sequencing depth, poor single-cell yield, and highly-advanced and polyclonal tumors. We further assess the performance of Canopy2 through application to breast cancer and glioblastoma data, benchmarking against existing methods. Canopy2 is an open-source R package available at https://github.com/annweideman/canopy2.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":"18 1","pages":"68-110"},"PeriodicalIF":0.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingfei Dong, Donatello Telesca, Abigail Dickinson, Catherine Sugar, Sara J Webb, Shafali Jeste, April R Levin, Frederick Shic, Adam Naples, Susan Faja, Geraldine Dawson, James C McPartland, Damla Şentürk
{"title":"Multilevel Multivariate Functional Principal Component Analysis of Evoked and Induced Event-Related Spectral Perturbations.","authors":"Mingfei Dong, Donatello Telesca, Abigail Dickinson, Catherine Sugar, Sara J Webb, Shafali Jeste, April R Levin, Frederick Shic, Adam Naples, Susan Faja, Geraldine Dawson, James C McPartland, Damla Şentürk","doi":"10.1007/s12561-025-09510-8","DOIUrl":"10.1007/s12561-025-09510-8","url":null,"abstract":"<p><p>Event-related spectral perturbations (ERSPs) capture dynamic changes in electroencephalography (EEG) power across frequency and trial time. Even though they are obtained at the trial level, they are commonly averaged across trials and analyzed at the subject level for enhancing the signal-to-noise ratio. While evoked activity is stimulus-locked, representing the brain's predictable response to stimuli, induced signals that are not strictly locked to stimulus presentation are thought to be generated by higher-order processes, such as attention and integration. Motivated by joint modeling of multilevel (trials nested in subjects) and multivariate (evoked and induced) ERSP data from a visual-evoked potentials (VEP) task, we propose a multilevel multivariate functional principal components analysis (FPCA) for high-dimensional functional outcomes as a function of time and frequency. The proposed estimation procedure utilizes multilevel univariate FPCA decompositions along each variate of the multivariate outcome using fast covariance estimation and incorporates the dependency across outcome variates at each level of the data. Hence, the proposed approach for multilevel multivariate FPCA can efficiently scale up to higher dimensional functional outcomes and increasing number of variates in the multivariate functional outcome vector. Extensive simulations show the efficacy of the proposed approach, while applications to VEP data lead to new insights on autism-specific neural activity patterns. The autistic group shows significantly lower evoked and higher induced gamma power compared to the neurotypical group. In addition, while subject level variation is dominated by variation in the stimulus-locked evoked signal in neurotypical development, it is dominated by induced power in autism.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834560/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146067612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted Brier Score - an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration.","authors":"Kehao Zhu, Yingye Zheng, Kwun Chuen Gary Chan","doi":"10.1007/s12561-025-09505-5","DOIUrl":"10.1007/s12561-025-09505-5","url":null,"abstract":"<p><p>As advancements in novel biomarker-based algorithms and models accelerate their use in disease risk prediction, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is referred to as clinical utility, which received significant attention and desire to incorporate into model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, and we link the weighted Brier score to the <math><mi>H</mi></math> measure, which has been proposed as an alternative to the area under the receiver operating characteristic curve. This theoretical link to the <math><mi>H</mi></math> measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from a prostate cancer study.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12523994/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145309467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accounting for Competing Risks in the Assessment of Prognostic Biomarkers' Discriminative Accuracy.","authors":"Xinran Huang, Xinyang Jiang, Ruosha Li, Jing Ning","doi":"10.1007/s12561-025-09499-0","DOIUrl":"https://doi.org/10.1007/s12561-025-09499-0","url":null,"abstract":"<p><p>The discriminative performance of biomarkers often changes over time and exhibits heterogeneity across subgroups defined by patient characteristics. Assessing how this performance varies with these factors is crucial for a comprehensive evaluation of biomarkers and to identify areas for improvement in sub-populations with poor performance. Additionally, the presence of competing risks complicates the assessment of discriminative performance. Ignoring competing risks can lead to misleading conclusions, as the biomarker's performance for the event of interest, such as disease onset, may be confounded by its performance for competing events, such as death. To address these challenges, we develop a regression model to assess the impact of covariates on the discriminative performance of biomarkers, characterized by the covariate-specific time-dependent Area-undercurve (AUC) for a specific cause. We construct a pseudo partial-likelihood for estimation and inference and establish the asymptotic properties of the proposed estimators. Through simulation studies, we demonstrate the finite sample performance of these estimators, and we apply the proposed method to data from the African American Study of Kidney Disease and Hypertension (AASK).</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144973349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Privacy-Preserving Models for Cluster-Level Confounding: Recognizing Disparities in Access to Transplantation.","authors":"Nicholas Hartman, Kevin He","doi":"10.1007/s12561-025-09496-3","DOIUrl":"10.1007/s12561-025-09496-3","url":null,"abstract":"<p><p>In health services applications where the patients are clustered within common institutions or geographic regions, it is often of interest to estimate the treatment effects of the medical providers after adjusting for confounding risk factors that are related to patients' choices of provider but beyond the providers' control. While most existing risk-adjustment methods are only capable of controlling for patient-level confounding risk factors (e.g., age or comorbidities), there are often important cluster-level confounding variables (e.g., regional or community-level risk factors) that should be accounted for in provider evaluations. These adjustments for cluster-level confounding factors are further complicated by the limited availability of protected patient health data, the inevitable influence of unobservable confounding factors, and the presence of outlying cluster units. To address these issues, we propose a privacy-preserving model and a novel Pseudo-Bayesian inference method to robustly assess the providers' treatment effects with adjustments for observed cluster-level confounders and corrections for overdispersion from unobserved cluster-level confounding factors. We derive theoretical connections between our proposed estimation method and the Correlated Random Effects model, uncovering several advantages in terms of estimation stability, computational efficiency, and privacy preservation. Motivated by efforts to improve equity in transplant care, we apply these methods to evaluate transplant centers while adjusting for observed geographic disparities in donor organ availability and correcting for overdispersion from unobservable confounding factors, such as the complex impact of the COVID-19 pandemic.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12830051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Central Posterior Envelopes for Bayesian Longitudinal Functional Principal Component Analysis.","authors":"Joanna Boland, Qi Qian, Donatello Telesca, Shafali Jeste, Abigail Dickinson, Damla Şentürk","doi":"10.1007/s12561-025-09497-2","DOIUrl":"10.1007/s12561-025-09497-2","url":null,"abstract":"<p><p>Longitudinally observed functional data are commonly encountered in biomedical studies. Under the weak separability assumption of the high dimensional covariance, the recently proposed Bayesian longitudinal functional principal component analysis (B-LFPCA) achieves the decomposition of the multidimensional signal into highly interpretable lower dimensional summaries, including eigenfunctions that capture directions of variation in the data along the longitudinal and functional dimensions. B-LFPCA provides uncertainty quantification of the estimated functional decomposition components through simultaneous parametric credible bands formed using the posterior sample. However, these traditional summaries are inherently based on point-wise summaries of the estimated functional components and do not take into account the functional nature of the estimated quantities. We introduce central posterior envelopes (CPEs) for uncertainty quantification of the low-dimensional B-LFPCA decomposition components based on functional depth ordering of the posterior estimates. The proposed CPEs are fully data-driven visualization tools, displaying the most-central regions of the posterior sample at specified <math><mi>α</mi></math> -level percentile contours. Modified band depth and modified volume depth are utilized to order posterior sample of functional decomposition components, including the mean function and the marginal longitudinal and functional eigenfunctions. The proposed CPEs are applied to analyze the longitudinally observed Event Related Potentials (ERPs) recorded during an implicit learning paradigm, leading to novel insights on longitudinal learning trends across a group of autistic kids and their neurotypical peers. Finally, effectiveness of the proposed CPEs is demonstrated through extensive simulations that explore different scenarios of increased variability in the longitudinal functional data.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716410/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145805508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bias and Efficiency Comparison between Multiple Imputation and Available-Case Analysis for Missing Data in Longitudinal Models.","authors":"Panpan Zhang, Sharon X Xie","doi":"10.1007/s12561-025-09493-6","DOIUrl":"10.1007/s12561-025-09493-6","url":null,"abstract":"<p><p>In this paper, we compare the performance of available-case analysis (ACA) and several multiple imputation (MI) approaches for handling missing data problems in longitudinal analysis through estimation bias and relative efficiency. When the missingness of covariates depends on observed responses, ACA produces estimation bias, but it is preferred when there are only missing values in longitudinal responses. Multilevel MI methods are not always a solution to longitudinal data analysis. Single-level MI methods, like fully conditional specification (FCS), provide unbiased estimates under a variety of missing data scenarios, and improve efficiency gain in certain scenarios. The general assumption of missing data mechanism is missing at random (MAR). We carry out a systematic synthetic data analysis where missing data exist in longitudinal outcomes or/and covariates under different kinds of missing data generation procedures. The analysis model is a linear mixed-effects model. For each of the missing data scenarios, we give our recommendation (between ACA and a specific MI method) based on theoretical justifications and extensive simulations. In addition, a longitudinal neurodegenerative disease dataset is used as a real case study.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144875909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Covariate-Balancing-Aware Interpretable Deep Learning Models for Treatment Effect Estimation.","authors":"Kan Chen, Qishuo Yin, Qi Long","doi":"10.1007/s12561-023-09394-6","DOIUrl":"10.1007/s12561-023-09394-6","url":null,"abstract":"<p><p>Estimating treatment effects is of great importance for many biomedical applications with observational data. Particularly, interpretability of the treatment effects is preferable for many biomedical researchers. In this paper, we first provide a theoretical analysis and derive an upper bound for the bias of average treatment effect (ATE) estimation under the strong ignorability assumption. Derived by leveraging appealing properties of the weighted energy distance, our upper bound is tighter than what has been reported in the literature. Motivated by the theoretical analysis, we propose a novel objective function for estimating the ATE that uses the energy distance balancing score and hence does not require the correct specification of the propensity score model. We also leverage recently developed neural additive models to improve interpretability of deep learning models used for potential outcome prediction. We further enhance our proposed model with an energy distance balancing score weighted regularization. The superiority of our proposed model over current state-of-the-art methods is demonstrated in semi-synthetic experiments using two benchmark datasets, namely, IHDP and ACIC, as well as is examined through the study of the effect of smoking on the blood level of cadmium using NHANES.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":"17 1","pages":"132-150"},"PeriodicalIF":0.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11957463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zhai, Youngwon Choi, Xingyi Yang, Yin Chen, Kenneth Knox, Homer L Twigg, Joong-Ho Won, Hua Zhou, Jin J Zhou
{"title":"DeepBiome: A Phylogenetic Tree Informed Deep Neural Network for Microbiome Data Analysis.","authors":"Jing Zhai, Youngwon Choi, Xingyi Yang, Yin Chen, Kenneth Knox, Homer L Twigg, Joong-Ho Won, Hua Zhou, Jin J Zhou","doi":"10.1007/s12561-024-09434-9","DOIUrl":"10.1007/s12561-024-09434-9","url":null,"abstract":"<p><p>Evidence linking the microbiome to human health is rapidly growing. The microbiome profile has the potential as a novel predictive biomarker for many diseases. However, tables of bacterial counts are typically sparse, and bacteria are classified within a hierarchy of taxonomic levels, ranging from species to phylum. Existing tools focus on identifying microbiome associations at either the community level or a specific, pre-defined taxonomic level. Incorporating the evolutionary relationship between bacteria can enhance data interpretation. This approach allows for aggregating microbiome contributions, leading to more accurate and interpretable results. We present DeepBiome, a phylogeny-informed neural network architecture, to predict phenotypes from microbiome counts and uncover the microbiome-phenotype association network. It utilizes microbiome abundance as input and employs phylogenetic taxonomy to guide the neural network's architecture. Leveraging phylogenetic information, DeepBiome is applicable to both regression and reduces the need for extensive tuning of the deep learning architecture, minimizes overfitting, and, crucially, enables the visualization of the path from microbiome counts to disease. It classification problems. Simulation studies and real-life data analysis have shown that DeepBiome is both highly accurate and efficient. It offers deep insights into complex microbiome-phenotype associations, even with small to moderate training sample sizes. In practice, the specific taxonomic level at which microbiome clusters tag the association remains unknown. Therefore, the main advantage of the presented method over other analytical methods is that it offers an ecological and evolutionary understanding of host-microbe interactions, which is important for microbiome-based medicine. DeepBiome is implemented using Python packages Keras and TensorFlow. It is an open-source tool available at https://github.com/Young-won/DeepBiome.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":"17 1","pages":"191-215"},"PeriodicalIF":0.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395559/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144973306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremy Rubin, Fan Fan, Laura Barisoni, Andrew R Janowczyk, Jarcy Zee
{"title":"Novel Scalar-on-matrix Regression for Unbalanced Feature Matrices.","authors":"Jeremy Rubin, Fan Fan, Laura Barisoni, Andrew R Janowczyk, Jarcy Zee","doi":"10.1007/s12561-025-09476-7","DOIUrl":"10.1007/s12561-025-09476-7","url":null,"abstract":"<p><p>Image features that characterize tubules from digitized kidney biopsies may offer insight into disease prognosis as novel biomarkers. For each subject, we can construct a matrix whose entries are a common set of image features (e.g., area, orientation, eccentricity) that are measured for each tubule from that subject's biopsy. Previous scalar-on-matrix regression approaches which can predict scalar outcomes using image feature matrices cannot handle varying numbers of tubules across subjects. We propose the CLUstering Structured laSSO (CLUSSO), a novel scalar-on-matrix regression technique that allows for unbalanced numbers of tubules, to predict scalar outcomes from the image feature matrices. Through classifying tubules into one of two different clusters, CLUSSO averages and weights tubular feature values within-subject and within-cluster to create balanced feature matrices that can then be used with structured lasso regression. We develop the theoretical large tubule sample properties for the error bounds of the feature coefficient estimates. Simulation study results indicate that CLUSSO often achieves a lower false positive rate and higher true positive rate for identifying the image features which truly affect outcomes relative to a naive method that averages feature values across all tubules. Additionally, we find that CLUSSO has lower bias and can predict outcomes with a competitive accuracy to the naïve approach. Finally, we applied CLUSSO to tubular image features from kidney biopsies of glomerular disease subjects from the Nephrotic Syndrome Study Network (NEPTUNE) to predict kidney function and used subjects from the Cure Glomerulonephropathy (CureGN) study as an external validation set.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145138874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}