Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder
{"title":"Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups","authors":"Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder","doi":"10.1186/s12874-024-02327-x","DOIUrl":"https://doi.org/10.1186/s12874-024-02327-x","url":null,"abstract":"In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Gunge Riberholt, Markus Harboe Olsen, Joachim Birch Milan, Sigurlaug Hanna Hafliðadóttir, Jeppe Houmann Svanholm, Elisabeth Buck Pedersen, Charles Chin Han Lew, Mark Aninakwah Asante, Johanne Pereira Ribeiro, Vibeke Wagner, Buddheera W. M. B. Kumburegama, Zheng-Yii Lee, Julie Perrine Schaug, Christina Madsen, Christian Gluud
{"title":"Major mistakes or errors in the use of trial sequential analysis in systematic reviews or meta-analyses – the METSA systematic review","authors":"Christian Gunge Riberholt, Markus Harboe Olsen, Joachim Birch Milan, Sigurlaug Hanna Hafliðadóttir, Jeppe Houmann Svanholm, Elisabeth Buck Pedersen, Charles Chin Han Lew, Mark Aninakwah Asante, Johanne Pereira Ribeiro, Vibeke Wagner, Buddheera W. M. B. Kumburegama, Zheng-Yii Lee, Julie Perrine Schaug, Christina Madsen, Christian Gluud","doi":"10.1186/s12874-024-02318-y","DOIUrl":"https://doi.org/10.1186/s12874-024-02318-y","url":null,"abstract":"Systematic reviews and data synthesis of randomised clinical trials play a crucial role in clinical practice, research, and health policy. Trial sequential analysis can be used in systematic reviews to control type I and type II errors, but methodological errors including lack of protocols and transparency are cause for concern. We assessed the reporting of trial sequential analysis. We searched Medline and the Cochrane Database of Systematic Reviews from 1 January 2018 to 31 December 2021 for systematic reviews and meta-analysis reports that include a trial sequential analysis. Only studies with at least two randomised clinical trials analysed in a forest plot and a trial sequential analysis were included. Two independent investigators assessed the studies. We evaluated protocolisation, reporting, and interpretation of the analyses, including their effect on any GRADE evaluation of imprecision. We included 270 systematic reviews and 274 meta-analysis reports and extracted data from 624 trial sequential analyses. Only 134/270 (50%) systematic reviews planned the trial sequential analysis in the protocol. For analyses on dichotomous outcomes, the proportion of events in the control group was missing in 181/439 (41%), relative risk reduction in 105/439 (24%), alpha in 30/439 (7%), beta in 128/439 (29%), and heterogeneity in 232/439 (53%). For analyses on continuous outcomes, the minimally relevant difference was missing in 125/185 (68%), variance (or standard deviation) in 144/185 (78%), alpha in 23/185 (12%), beta in 63/185 (34%), and heterogeneity in 105/185 (57%). Graphical illustration of the trial sequential analysis was present in 93% of the analyses, however, the Z-curve was wrongly displayed in 135/624 (22%) and 227/624 (36%) did not include futility boundaries. The overall transparency of all 624 analyses was very poor in 236 (38%) and poor in 173 (28%). The majority of trial sequential analyses are not transparent when preparing or presenting the required parameters, partly due to missing or poorly conducted protocols. This hampers interpretation, reproducibility, and validity. PROSPERO CRD42021273811","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Hebdon, James Stamey, David Kahle, Xiang Zhang
{"title":"unmconf : an R package for Bayesian regression with unmeasured confounders.","authors":"Ryan Hebdon, James Stamey, David Kahle, Xiang Zhang","doi":"10.1186/s12874-024-02322-2","DOIUrl":"10.1186/s12874-024-02322-2","url":null,"abstract":"<p><p>The inability to correctly account for unmeasured confounding can lead to bias in parameter estimates, invalid uncertainty assessments, and erroneous conclusions. Sensitivity analysis is an approach to investigate the impact of unmeasured confounding in observational studies. However, the adoption of this approach has been slow given the lack of accessible software. An extensive review of available R packages to account for unmeasured confounding list deterministic sensitivity analysis methods, but no R packages were listed for probabilistic sensitivity analysis. The R package unmconf implements the first available package for probabilistic sensitivity analysis through a Bayesian unmeasured confounding model. The package allows for normal, binary, Poisson, or gamma responses, accounting for one or two unmeasured confounders from the normal or binomial distribution. The goal of unmconf is to implement a user friendly package that performs Bayesian modeling in the presence of unmeasured confounders, with simple commands on the front end while performing more intensive computation on the back end. We investigate the applicability of this package through novel simulation studies. The results indicate that credible intervals will have near nominal coverage probability and smaller bias when modeling the unmeasured confounder(s) for varying levels of internal/external validation data across various combinations of response-unmeasured confounder distributional families.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11380322/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142145123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyu Lai, Kaiye Gao, Meiyan Li, Tao Li, Xiaodong Zhou, Xingtao Zhou, Hui Guo, Bo Fu
{"title":"Handling missing data and measurement error for early-onset myopia risk prediction models.","authors":"Hongyu Lai, Kaiye Gao, Meiyan Li, Tao Li, Xiaodong Zhou, Xingtao Zhou, Hui Guo, Bo Fu","doi":"10.1186/s12874-024-02319-x","DOIUrl":"10.1186/s12874-024-02319-x","url":null,"abstract":"<p><strong>Background: </strong>Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.</p><p><strong>Methods: </strong>We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).</p><p><strong>Results: </strong>Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.</p><p><strong>Conclusion: </strong>MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11378546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142145122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rheanna M Mainzer, Margarita Moreno-Betancur, Cattram D Nguyen, Julie A Simpson, John B Carlin, Katherine J Lee
{"title":"Gaps in the usage and reporting of multiple imputation for incomplete data: findings from a scoping review of observational studies addressing causal questions.","authors":"Rheanna M Mainzer, Margarita Moreno-Betancur, Cattram D Nguyen, Julie A Simpson, John B Carlin, Katherine J Lee","doi":"10.1186/s12874-024-02302-6","DOIUrl":"10.1186/s12874-024-02302-6","url":null,"abstract":"<p><strong>Background: </strong>Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions (\"missing completely at random\", \"missing at random\" [MAR], \"missing not at random\") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation.</p><p><strong>Methods: </strong>We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically.</p><p><strong>Results: </strong>Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis.</p><p><strong>Conclusion: </strong>Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373423/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Patient regional index: a new way to rank clinical specialties based on outpatient clinics big data.","authors":"Xiaoling Peng, Moyuan Huang, Xinyang Li, Tianyi Zhou, Guiping Lin, Xiaoguang Wang","doi":"10.1186/s12874-024-02309-z","DOIUrl":"10.1186/s12874-024-02309-z","url":null,"abstract":"<p><strong>Background: </strong>Many existing healthcare ranking systems are notably intricate. The standards for peer review and evaluation often differ across specialties, leading to contradictory results among various ranking systems. There is a significant need for a comprehensible and consistent mode of specialty assessment.</p><p><strong>Methods: </strong>This quantitative study aimed to assess the influence of clinical specialties on the regional distribution of patient origins based on 10,097,795 outpatient records of a large comprehensive hospital in South China. We proposed the patient regional index (PRI), a novel metric to quantify the regional influence of hospital specialties, using the principle of representative points of a statistical distribution. Additionally, a two-dimensional measure was constructed to gauge the significance of hospital specialties by integrating the PRI and outpatient volume.</p><p><strong>Results: </strong>We calculated the PRI for each of the 16 specialties of interest over eight consecutive years. The longitudinal changes in the PRI accurately captured the impact of the 2017 Chinese healthcare reforms and the 2020 COVID-19 pandemic on hospital specialties. At last, the two-dimensional assessment model we devised effectively illustrates the distinct characteristics across hospital specialties.</p><p><strong>Conclusion: </strong>We propose a novel, straightforward, and interpretable index for quantifying the influence of hospital specialties. This index, built on outpatient data, requires only the patients' origin, thereby facilitating its widespread adoption and comparison across specialties of varying backgrounds. This data-driven method offers a patient-centric view of specialty influence, diverging from the traditional reliance on expert opinions. As such, it serves as a valuable augmentation to existing ranking systems.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11365139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Imad El Badisy, Nathalie Graffeo, Mohamed Khalis, Roch Giorgi
{"title":"Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.","authors":"Imad El Badisy, Nathalie Graffeo, Mohamed Khalis, Roch Giorgi","doi":"10.1186/s12874-024-02305-3","DOIUrl":"https://doi.org/10.1186/s12874-024-02305-3","url":null,"abstract":"<p><p>Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363416/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann
{"title":"Distributed non-disclosive validation of predictive models by a modified ROC-GLM.","authors":"Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann","doi":"10.1186/s12874-024-02312-4","DOIUrl":"https://doi.org/10.1186/s12874-024-02312-4","url":null,"abstract":"<p><strong>Background: </strong>Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach.</p><p><strong>Methods: </strong>We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the true ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> ) to the data and enables a global handling of placement numbers. The impact of DP parameters was studied by simulations.</p><p><strong>Results: </strong>In our simulation scenario, the true and distributed AUC measures differ by <math><mrow><mi>Δ</mi> <mtext>AUC</mtext> <mo><</mo> <mn>0.01</mn></mrow> </math> depending heavily on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios along with a reasonable choice of DP parameters. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.</p><p><strong>Conclusions: </strong>The applicability of our algorithms depends on the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> , the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363434/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandro Villasante-Tezanos, Yong-Fang Kuo, Christopher Kurinec, Yisheng Li, Xiaoying Yu
{"title":"A non-parametric approach to predict the recruitment for randomized clinical trials: an example in elderly inpatient settings.","authors":"Alejandro Villasante-Tezanos, Yong-Fang Kuo, Christopher Kurinec, Yisheng Li, Xiaoying Yu","doi":"10.1186/s12874-024-02314-2","DOIUrl":"10.1186/s12874-024-02314-2","url":null,"abstract":"<p><strong>Background: </strong>Accurate prediction of subject recruitment, which is critical to the success of a study, remains an ongoing challenge. Previous prediction models often rely on parametric assumptions which are not always met or may be difficult to implement. We aim to develop a novel method that is less sensitive to model assumptions and relatively easy to implement.</p><p><strong>Methods: </strong>We create a weighted resampling-based approach to predict enrollment in year two based on recruitment data from year one of the completed GRIPS and PACE clinical trials. Different weight functions accounted for a range of potential enrollment trajectory patterns. Prediction accuracy was measured by Euclidean distance for enrollment sequence in year two, total enrollment over time, and total weeks to enroll a fixed number of subjects, against the actual year two enrollment data. We compare the performance of the proposed method with an existing Bayesian method.</p><p><strong>Results: </strong>Weighted resampling using GRIPS data resulted in closer prediction evidenced by better coverage of observed enrollment with the prediction intervals and smaller Euclidean distance from actual enrollment in year 2, especially when enrollment gaps were filled prior to the weighted resampling. These scenarios also produced more accurate predictions for total enrollment and number of weeks to enroll 50 participants. These same scenarios outperformed an existing Bayesian method for all 3 accuracy measures. In PACE data, using a reduced year 1 enrollment resulted in closer prediction evidenced by better coverage of observed enrollment with the prediction intervals and smaller Euclidean distance from actual enrollment in year 2, with the weighted resampling scenarios better reflecting the seasonal variation seen in year (1) The reduced enrollment scenarios resulted in closer prediction for total enrollment over 6 and 12 months into year (2) These same scenarios also outperformed an existing Bayesian method for relevant accuracy measures.</p><p><strong>Conclusion: </strong>The results demonstrate the feasibility and flexibility for a resampling-based, non-parametric approach for prediction of clinical trial recruitment with limited early enrollment data. Application to a wider setting and long-term prediction accuracy require further investigation.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.","authors":"Marziyeh Afkanpour, Elham Hosseinzadeh, Hamed Tabesh","doi":"10.1186/s12874-024-02310-6","DOIUrl":"10.1186/s12874-024-02310-6","url":null,"abstract":"<p><strong>Background and objectives: </strong>Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field.</p><p><strong>Materials and methods: </strong>We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset.</p><p><strong>Results: </strong>Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values.</p><p><strong>Conclusion: </strong>Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142092238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}