{"title":"Quantifying Direct and Indirect Effects through Joint Modeling of Terminal Events and Gap Times between Recurrent Events.","authors":"Fang Niu, Cheng Zheng, Lei Liu","doi":"10.6339/26-jds1227","DOIUrl":"10.6339/26-jds1227","url":null,"abstract":"<p><p>Joint models can describe the relationship between recurrent and terminal events. Typically, recurrent events are modeled using the total time scale, assuming constant covariate effects on each recurrent event. However, modeling the gap time between recurrent events could allow varying covariate effects and offer greater flexibility and accuracy. For instance, in HIV-infected patients, the intervals between the first occurrence of opportunistic infections (OIs) may follow a different distribution compared to later OIs. However, limited research has focused on mediation analysis using joint modeling of gap times and survival time. In this work, we propose a novel joint modeling approach that studies the mediation effect of recurrent events on survival outcomes by modeling the recurrent events by gap time. This allows us to handle cases where the first occurrence of a recurrent event behaves differently from subsequent events. Additionally, we use a relaxed \"sequential ignorability\" assumption to address unmeasured confounding. Simulation studies demonstrate that our model performs well in estimating both model parameters and mediation effects. We apply our method to an AIDS study to evaluate the comparative effectiveness of two treatments and the effect of baseline CD4 counts on overall survival, mediated by recurrent opportunistic infections modeled through gap times.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13089381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147724773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runqiu Wang, Ran Dai, Ying Huang, Marian L Neuhouser, Johanna W Lampe, Daniel Raftery, Fred K Tabung, Cheng Zheng
{"title":"Variable Selection with FDR Control for Noisy Data - An Application to Screening Metabolites that Are Associated with Breast Cancer and Colorectal Cancer.","authors":"Runqiu Wang, Ran Dai, Ying Huang, Marian L Neuhouser, Johanna W Lampe, Daniel Raftery, Fred K Tabung, Cheng Zheng","doi":"10.6339/25-jds1166","DOIUrl":"https://doi.org/10.6339/25-jds1166","url":null,"abstract":"<p><p>The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in developing reliable and reproducible approaches for disease association studies. Therefore, there is a compelling need for robust statistical analyses that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we construct algorithms to perform variable selection for noisy data and control the False Discovery Rate when selecting mutual metabolomic predictors for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Women's Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"23 3","pages":"499-520"},"PeriodicalIF":0.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13108675/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147790971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manqi Cai, Kangyi Zhao, Penghui Huang, Juan C Celedón, Chris McKennan, Wei Chen, Jiebiao Wang
{"title":"EMixed: Probabilistic Multi-Omics Cellular Deconvolution of Bulk Omics Data.","authors":"Manqi Cai, Kangyi Zhao, Penghui Huang, Juan C Celedón, Chris McKennan, Wei Chen, Jiebiao Wang","doi":"10.6339/25-jds1170","DOIUrl":"10.6339/25-jds1170","url":null,"abstract":"<p><p>Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Innovative Method of Singular Spectrum Analysis to Conduct Gap-filling and Denoising on Time Series Data.","authors":"James J Yang, Anne Buu","doi":"10.6339/25-jds1164","DOIUrl":"10.6339/25-jds1164","url":null,"abstract":"<p><p>Heart rate data collected from wearable devices - one type of time series data - could provide insights into activities, stress levels, and health. Yet, consecutive missing segments (i.e., gaps) that commonly occur due to improper device placement or device malfunction could distort the temporal patterns inherent in the data and undermine the validity of downstream analyses. This study proposes an innovative iterative procedure to fill gaps in time series data that capitalizes on the denoising capability of Singular Spectrum Analysis (SSA) and eliminates SSA's requirement of pre-specifying the window length and number of groups. The results of simulations demonstrate that the performance of SSA-based gap-filling methods depends on the choice of window length, number of groups, and the percentage of missing values. In contrast, the proposed method consistently achieves the lowest rates of reconstruction error and gap-filling error across a variety of combinations of the factors manipulated in the simulations. The simulation findings also highlight that the commonly recommended long window length - half of the time series length - may not apply to time series with varying frequencies such as heart rate data. The initialization step of the proposed method that involves a large window length and the first four singular values in the iterative singular value decomposition process not only avoids convergence issues but also facilitates imputation accuracy in subsequent iterations. The proposed method provides the flexibility for researchers to conduct gap-filling solely or in combination with denoising on time series data and thus widens the applications.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12439824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruiwen Zhou, Kevin He, Di Wang, Lili Liu, Shujie Ma, Annie Qu, J Philip Miller, Lei Liu
{"title":"Neural Network for Correlated Survival Outcomes Using Frailty Model.","authors":"Ruiwen Zhou, Kevin He, Di Wang, Lili Liu, Shujie Ma, Annie Qu, J Philip Miller, Lei Liu","doi":"10.6339/25-jds1173","DOIUrl":"10.6339/25-jds1173","url":null,"abstract":"<p><p>Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"23 4","pages":"624-637"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12829921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior.","authors":"Mingxuan Zhang, Yan Sun, Faming Liang","doi":"10.6339/24-jds1156","DOIUrl":"10.6339/24-jds1156","url":null,"abstract":"<p><p>Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model's expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12629628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Na Bo, Yue Wei, Lang Zeng, Chaeryon Kang, Ying Ding
{"title":"A Meta-Learner Framework to Estimate Individualized Treatment Effects for Survival Outcomes.","authors":"Na Bo, Yue Wei, Lang Zeng, Chaeryon Kang, Ying Ding","doi":"10.6339/24-jds1119","DOIUrl":"10.6339/24-jds1119","url":null,"abstract":"<p><p>One crucial aspect of precision medicine is to allow physicians to recommend the most suitable treatment for their patients. This requires understanding the treatment heterogeneity from a patient-centric view, quantified by estimating the individualized treatment effect (ITE). With a large amount of genetics data and medical factors being collected, a complete picture of individuals' characteristics is forming, which provides more opportunities to accurately estimate ITE. Recent development using machine learning methods within the counterfactual outcome framework shows excellent potential in analyzing such data. In this research, we propose to extend meta-learning approaches to estimate individualized treatment effects with survival outcomes. Two meta-learning algorithms are considered, T-learner and X-learner, each combined with three types of machine learning methods: random survival forest, Bayesian accelerated failure time model and survival neural network. We examine the performance of the proposed methods and provide practical guidelines for their application in randomized clinical trials (RCTs). Moreover, we propose to use the Boruta algorithm to identify risk factors that contribute to treatment heterogeneity based on ITE estimates. The finite sample performances of these methods are compared through extensive simulations under different randomization designs. The proposed approach is applied to a large RCT of eye disease, namely, age-related macular degeneration (AMD), to estimate the ITE on delaying time-to-AMD progression and to make individualized treatment recommendations.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"22 4","pages":"505-523"},"PeriodicalIF":0.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12440118/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum Likelihood Estimation for Shape-restricted Single-index Hazard Models.","authors":"Jing Qin, Yifei Sun, Ao Yuan, Chiung-Yu Huang","doi":"10.6339/22-jds1061","DOIUrl":"10.6339/22-jds1061","url":null,"abstract":"<p><p>Single-index models are becoming increasingly popular in many scientific applications as they offer the advantages of flexibility in regression modeling as well as interpretable covariate effects. In the context of survival analysis, the single-index hazards models are natural extensions of the Cox proportional hazards models. In this paper, we propose a novel estimation procedure for single-index hazard models under a monotone constraint of the index. We apply the profile likelihood method to obtain the semiparametric maximum likelihood estimator, where the novelty of the estimation procedure lies in estimating the unknown monotone link function by embedding the problem in isotonic regression with exponentially distributed random variables. The consistency of the proposed semiparametric maximum likelihood estimator is established under suitable regularity conditions. Numerical simulations are conducted to examine the finite-sample performance of the proposed method. An analysis of breast cancer data is presented for illustration.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"681-695"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017303/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Central Posterior Envelopes for Bayesian Functional Principal Component Analysis.","authors":"Joanna Boland, Donatello Telesca, Catherine Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, Damla Şentürk","doi":"10.6339/23-jds1085","DOIUrl":"10.6339/23-jds1085","url":null,"abstract":"<p><p>Bayesian methods provide direct inference in functional data analysis applications without reliance on bootstrap techniques. A major tool in functional data applications is the functional principal component analysis which decomposes the data around a common mean function and identifies leading directions of variation. Bayesian functional principal components analysis (BFPCA) provides uncertainty quantification on the estimated functional model components via the posterior samples obtained. We propose central posterior envelopes (CPEs) for BFPCA based on functional depth as a descriptive visualization tool to summarize variation in the posterior samples of the estimated functional model components, contributing to uncertainty quantification in BFPCA. The proposed BFPCA relies on a latent factor model and targets model parameters within a mixed effects modeling framework using modified multiplicative gamma process shrinkage priors on the variance components. Functional depth provides a center-outward order to a sample of functions. We utilize modified band depth and modified volume depth for ordering of a sample of functions and surfaces, respectively, to derive at CPEs of the mean and eigenfunctions within the BFPCA framework. The proposed CPEs are showcased in extensive simulations. Finally, the proposed CPEs are applied to the analysis of a sample of power spectral densities (PSD) from resting state electroencephalography (EEG) where they lead to novel insights on diagnostic group differences among children diagnosed with autism spectrum disorder and their typically developing peers across age.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"715-734"},"PeriodicalIF":0.0,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11178334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore
{"title":"Optimal Physician Shared-Patient Networks and the Diffusion of Medical Technologies.","authors":"A James O'Malley, Xin Ran, Chuankai An, Daniel Rockmore","doi":"10.6339/22-jds1064","DOIUrl":"10.6339/22-jds1064","url":null,"abstract":"<p><p>Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of \"referral paths\" - sequences of patient-specific temporally linked physician visits - in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital's adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":"578-598"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10956597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}