{"title":"Selection of number of clusters and warping penalty in clustering functional electrocardiogram.","authors":"Wei Yang, Harold I Feldman, Wensheng Guo","doi":"10.1002/sim.10192","DOIUrl":"10.1002/sim.10192","url":null,"abstract":"<p><p>Clustering functional data aims to identify unique functional patterns in the entire domain, but this can be challenging due to phase variability that distorts the observed patterns. Curve registration can be used to remove this variability, but determining the appropriate level of warping flexibility can be complicated. Curve registration also requires a target to which a functional object is aligned, typically the cross-sectional mean of functional objects within the same cluster. However, this mean is unknown prior to clustering. Furthermore, there is a trade-off between flexible warping and the number of resulting clusters. Removing more phase variability through curve registration can lead to fewer remaining variations in the functional data, resulting in a smaller number of clusters. Thus, the optimal number of clusters and warping flexibility cannot be uniquely identified. We propose to use external information to solve the identification issue. We define a cross validated Kullback-Leibler information criterion to select the number of clusters and the warping penalty. The criterion is derived from the predictive classification likelihood considering the joint distribution of both the functional data and external variable and penalizes the uncertainty in the cluster membership. We evaluate our method through simulation and apply it to electrocardiographic data collected in the Chronic Renal Insufficiency Cohort study. We identify two distinct clusters of electrocardiogram (ECG) profiles, with the second cluster exhibiting ST segment depression, an indication of cardiac ischemia, compared to the normal ECG profiles in the first cluster.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142154970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Dimensional Overdispersed Generalized Factor Model With Application to Single-Cell Sequencing Data Analysis.","authors":"Jinyu Nie, Zhilong Qin, Wei Liu","doi":"10.1002/sim.10213","DOIUrl":"https://doi.org/10.1002/sim.10213","url":null,"abstract":"<p><p>The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Pigoli, Kieran Baker, Jobie Budd, Lorraine Butler, Harry Coppock, Sabrina Egglestone, Steven G Gilmour, Chris Holmes, David Hurley, Radka Jersakova, Ivan Kiskin, Vasiliki Koutra, Jonathon Mellor, George Nicholson, Joe Packham, Selina Patel, Richard Payne, Stephen J Roberts, Björn W Schuller, Ana Tendero-Cañadas, Tracey Thornley, Alexander Titcomb
{"title":"Assessing the Performance of Machine Learning Methods Trained on Public Health Observational Data: A Case Study From COVID-19.","authors":"Davide Pigoli, Kieran Baker, Jobie Budd, Lorraine Butler, Harry Coppock, Sabrina Egglestone, Steven G Gilmour, Chris Holmes, David Hurley, Radka Jersakova, Ivan Kiskin, Vasiliki Koutra, Jonathon Mellor, George Nicholson, Joe Packham, Selina Patel, Richard Payne, Stephen J Roberts, Björn W Schuller, Ana Tendero-Cañadas, Tracey Thornley, Alexander Titcomb","doi":"10.1002/sim.10211","DOIUrl":"https://doi.org/10.1002/sim.10211","url":null,"abstract":"<p><p>From early in the coronavirus disease 2019 (COVID-19) pandemic, there was interest in using machine learning methods to predict COVID-19 infection status based on vocal audio signals, for example, cough recordings. However, early studies had limitations in terms of data collection and of how the performances of the proposed predictive models were assessed. This article describes how these limitations have been overcome in a study carried out by the Turing-RSS Health Data Laboratory and the UK Health Security Agency. As part of the study, the UK Health Security Agency collected a dataset of acoustic recordings, SARS-CoV-2 infection status and extensive study participant meta-data. This allowed us to rigorously assess state-of-the-art machine learning techniques to predict SARS-CoV-2 infection status based on vocal audio signals. The lessons learned from this project should inform future studies on statistical evaluation methods to assess the performance of machine learning techniques for public health tasks.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Causal Mediation Approach to Account for Interaction of Treatment and Intercurrent Events: Using Hypothetical Strategy.","authors":"Kunpeng Wu, Xiangliang Zhang, Meng Zheng, Jianghui Zhang, Wen Chen","doi":"10.1002/sim.10212","DOIUrl":"https://doi.org/10.1002/sim.10212","url":null,"abstract":"<p><p>Hypothetical strategy is a common strategy for handling intercurrent events (IEs). No current guideline or study considers treatment-IE interaction to target the estimand in any one IE-handling strategy. Based on the hypothetical strategy, we aimed to (1) assess the performance of three estimators with different considerations for the treatment-IE interaction in a simulation and (2) compare the estimation of these estimators in a real trial. Simulation data were generalized based on realistic clinical trials of Alzheimer's disease. The estimand of interest was the effect of treatment with no IE occurring under the hypothetical strategy. Three estimators, namely, G-estimation with and without interaction and IE-ignored estimation, were compared in scenarios where the treatment-IE interaction effect was set as -50% to 50% of the main effect. Bias was the key performance measure. The real case was derived from a randomized trial of methadone maintenance treatment. Only G-estimation with interaction exhibited unbiased estimations regardless of the existence, direction or magnitude of the treatment-IE interaction in those scenarios. Neglecting the interaction and ignoring the IE would introduce a bias as large as 0.093 and 0.241 (true value, -1.561) if the interaction effect existed. In the real case, compared with G-estimation with interaction, G-estimation without interaction and IE-ignored estimation increased the estimand of interest by 33.55% and 34.36%, respectively. This study highlights the importance of considering treatment-IE interaction in the estimand framework. In practice, it would be better to include the interaction in the estimator by default.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate maximum likelihood estimation in cure models using aggregated data, with application to HPV vaccine completion.","authors":"John D Rice, Allison Kempe","doi":"10.1002/sim.10174","DOIUrl":"10.1002/sim.10174","url":null,"abstract":"<p><p>Research into vaccine hesitancy is a critical component of the public health enterprise, as rates of communicable diseases preventable by routine childhood immunization have been increasing in recent years. It is therefore important to estimate proportions of \"never-vaccinators\" in various subgroups of the population in order to successfully target interventions to improve childhood vaccination rates. However, due to privacy issues, it may be difficult to obtain individual patient data (IPD) needed to perform the appropriate time-to-event analyses: state-level immunization information services may only be willing to share aggregated data with researchers. We propose statistical methodology for the analysis of aggregated survival data that can accommodate a cured fraction based on a polynomial approximation of the mixture cure model log-likelihood function relying only on summary statistics. We study the performance of the method through simulation studies and apply it to a real-world data set from a study examining reminder/recall approaches to improve human papillomavirus (HPV) vaccination uptake. The proposed methods may be generalized for use when there is interest in fitting complex likelihood-based models but IPD is unavailable due to data privacy or other concerns.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142133804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kylie M Lange, Thomas R Sullivan, Jessica Kasza, Lisa N Yelland
{"title":"Performance of mixed effects models and generalized estimating equations for continuous outcomes in partially clustered trials including both independent and paired data.","authors":"Kylie M Lange, Thomas R Sullivan, Jessica Kasza, Lisa N Yelland","doi":"10.1002/sim.10201","DOIUrl":"https://doi.org/10.1002/sim.10201","url":null,"abstract":"<p><p>Many clinical trials involve partially clustered data, where some observations belong to a cluster and others can be considered independent. For example, neonatal trials may include infants from single or multiple births. Sample size and analysis methods for these trials have received limited attention. A simulation study was conducted to (1) assess whether existing power formulas based on generalized estimating equations (GEEs) provide an adequate approximation to the power achieved by mixed effects models, and (2) compare the performance of mixed models vs GEEs in estimating the effect of treatment on a continuous outcome. We considered clusters that exist prior to randomization with a maximum cluster size of 2, three methods of randomizing the clustered observations, and simulated datasets with uninformative cluster size and the sample size required to achieve 80% power according to GEE-based formulas with an independence or exchangeable working correlation structure. The empirical power of the mixed model approach was close to the nominal level when sample size was calculated using the exchangeable GEE formula, but was often too high when the sample size was based on the independence GEE formula. The independence GEE always converged and performed well in all scenarios. Performance of the exchangeable GEE and mixed model was also acceptable under cluster randomization, though under-coverage and inflated type I error rates could occur with other methods of randomization. Analysis of partially clustered trials using GEEs with an independence working correlation structure may be preferred to avoid the limitations of mixed models and exchangeable GEEs.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142133805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan C Moyer, Fan Li, Andrea J Cook, Patrick J Heagerty, Sherri L Pals, Elizabeth L Turner, Rui Wang, Yunji Zhou, Qilu Yu, Xueqi Wang, David M Murray
{"title":"Evaluating analytic models for individually randomized group treatment trials with complex clustering in nested and crossed designs.","authors":"Jonathan C Moyer, Fan Li, Andrea J Cook, Patrick J Heagerty, Sherri L Pals, Elizabeth L Turner, Rui Wang, Yunji Zhou, Qilu Yu, Xueqi Wang, David M Murray","doi":"10.1002/sim.10206","DOIUrl":"https://doi.org/10.1002/sim.10206","url":null,"abstract":"<p><p>Many individually randomized group treatment (IRGT) trials randomly assign individuals to study arms but deliver treatments via shared agents, such as therapists, surgeons, or trainers. Post-randomization interactions induce correlations in outcome measures between participants sharing the same agent. Agents can be nested in or crossed with trial arm, and participants may interact with a single agent or with multiple agents. These complications have led to ambiguity in choice of models but there have been no systematic efforts to identify appropriate analytic models for these study designs. To address this gap, we undertook a simulation study to examine the performance of candidate analytic models in the presence of complex clustering arising from multiple membership, single membership, and single agent settings, in both nested and crossed designs and for a continuous outcome. With nested designs, substantial type I error rate inflation was observed when analytic models did not account for multiple membership and when analytic model weights characterizing the association with multiple agents did not match the data generating mechanism. Conversely, analytic models for crossed designs generally maintained nominal type I error rates unless there was notable imbalance in the number of participants that interact with each agent.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142120561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenyi Lin, Jingjing Zou, Chongzhi Di, Cheryl L Rock, Loki Natarajan
{"title":"Multilevel Longitudinal Functional Principal Component Model.","authors":"Wenyi Lin, Jingjing Zou, Chongzhi Di, Cheryl L Rock, Loki Natarajan","doi":"10.1002/sim.10207","DOIUrl":"https://doi.org/10.1002/sim.10207","url":null,"abstract":"<p><p>Sensor devices, such as accelerometers, are widely used for measuring physical activity (PA). These devices provide outputs at fine granularity (e.g., 10-100 Hz or minute-level), which while providing rich data on activity patterns, also pose computational challenges with multilevel densely sampled data, resulting in PA records that are measured continuously across multiple days and visits. On the other hand, a scalar health outcome (e.g., BMI) is usually observed only at the individual or visit level. This leads to a discrepancy in numbers of nested levels between the predictors (PA) and outcomes, raising analytic challenges. To address this issue, we proposed a multilevel longitudinal functional principal component analysis (mLFPCA) model to directly model multilevel functional PA inputs in a longitudinal study, and then implemented a longitudinal functional principal component regression (FPCR) to explore the association between PA and obesity-related health outcomes. Additionally, we conducted a comprehensive simulation study to examine the impact of imbalanced multilevel data on both mLFPCA and FPCR performance and offer guidelines for selecting optimal methods.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anomaly Detection and Correction in Dense Functional Data Within Electronic Medical Records.","authors":"Daren Kuwaye, Hyunkeun Ryan Cho","doi":"10.1002/sim.10209","DOIUrl":"https://doi.org/10.1002/sim.10209","url":null,"abstract":"<p><p>In medical research, the accuracy of data from electronic medical records (EMRs) is critical, particularly when analyzing dense functional data, where anomalies can severely compromise research integrity. Anomalies in EMRs often arise from human errors in data measurement and entry, and increase in frequency with the volume of data. Despite the established methods in computer science, anomaly detection in medical applications remains underdeveloped. We address this deficiency by introducing a novel tool for identifying and correcting anomalies specifically in dense functional EMR data. Our approach utilizes studentized residuals from a mean-shift model, and therefore assumes that the data adheres to a smooth functional trajectory. Additionally, our method is tailored to be conservative, focusing on anomalies that signify actual errors in the data collection process while controlling for false discovery rates and type II errors. To support widespread implementation, we provide a comprehensive R package, ensuring that our methods can be applied in diverse settings. Our methodology's efficacy has been validated through rigorous simulation studies and real-world applications, confirming its ability to accurately identify and correct errors, thus enhancing the reliability and quality of medical data analysis.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142120560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael W Robbins, Sebastian Bauhoff, Lane Burgette
{"title":"Data fusion for predicting long-term program impacts.","authors":"Michael W Robbins, Sebastian Bauhoff, Lane Burgette","doi":"10.1002/sim.10147","DOIUrl":"10.1002/sim.10147","url":null,"abstract":"<p><p>Policymakers often require information on programs' long-term impacts that is not available when decisions are made. For example, while rigorous evidence from the Oregon Health Insurance Experiment (OHIE) shows that having health insurance influences short-term health and financial measures, the impact on long-term outcomes, such as mortality, will not be known for many years following the program's implementation. We demonstrate how data fusion methods may be used address the problem of missing final outcomes and predict long-run impacts of interventions before the requisite data are available. We implement this method by concatenating data on an intervention (such as the OHIE) with auxiliary long-term data and then imputing missing long-term outcomes using short-term surrogate outcomes while approximating uncertainty with replication methods. We use simulations to examine the performance of the methodology and apply the method in a case study. Specifically, we fuse data on the OHIE with data from the National Longitudinal Mortality Study and estimate that being eligible to apply for subsidized health insurance will lead to a statistically significant improvement in long-term mortality.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141421005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}