{"title":"Penalized Regression Methods With Modified Cross-Validation and Bootstrap Tuning Produce Better Prediction Models","authors":"Menelaos Pavlou, Rumana Z. Omar, Gareth Ambler","doi":"10.1002/bimj.202300245","DOIUrl":"10.1002/bimj.202300245","url":null,"abstract":"<p>Risk prediction models fitted using maximum likelihood estimation (MLE) are often overfitted resulting in predictions that are too extreme and a calibration slope (CS) less than 1. Penalized methods, such as Ridge and Lasso, have been suggested as a solution to this problem as they tend to shrink regression coefficients toward zero, resulting in predictions closer to the average. The amount of shrinkage is regulated by a tuning parameter, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>λ</mi>\u0000 <mo>,</mo>\u0000 </mrow>\u0000 <annotation>$lambda ,$</annotation>\u0000 </semantics></math> commonly selected via cross-validation (“standard tuning”). Though penalized methods have been found to improve calibration on average, they often over-shrink and exhibit large variability in the selected <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and hence the CS. This is a problem, particularly for small sample sizes, but also when using sample sizes recommended to control overfitting. We consider whether these problems are partly due to selecting <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> using cross-validation with “training” datasets of reduced size compared to the original development sample, resulting in an over-estimation of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and, hence, excessive shrinkage. We propose a modified cross-validation tuning method (“modified tuning”), which estimates <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> from a pseudo-development dataset obtained via bootstrapping from the original dataset, albeit of larger size, such that the resulting cross-validation training datasets are of the same size as the original dataset. Modified tuning can be easily implemented in standard software and is closely related to bootstrap selection of the tuning parameter (“bootstrap tuning”). We evaluated modified and bootstrap tuning for Ridge and Lasso in simulated and real data using recommended sample sizes, and sizes slightly lower and higher. They substantially improved the selection of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math>, resulting in improved CS compared to the standard tuning method. They also improved predictions compared to MLE.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141460887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Causal inference in the absence of positivity: The role of overlap weights","authors":"Roland A. Matsouaka, Yunji Zhou","doi":"10.1002/bimj.202300156","DOIUrl":"10.1002/bimj.202300156","url":null,"abstract":"<p>How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.</p><p>We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.</p><p>Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe
{"title":"Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values","authors":"Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe","doi":"10.1002/bimj.202300090","DOIUrl":"10.1002/bimj.202300090","url":null,"abstract":"<p>Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300090","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data","authors":"Thierry Chekouo, Himadri Mukherjee","doi":"10.1002/bimj.202300173","DOIUrl":"10.1002/bimj.202300173","url":null,"abstract":"<p>We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300173","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141181467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Valid instrumental variable selection method using negative control outcomes and constructing efficient estimator","authors":"Shunichiro Orihara, Atsushi Goto, Masataka Taguri","doi":"10.1002/bimj.202300113","DOIUrl":"10.1002/bimj.202300113","url":null,"abstract":"<p>In observational studies, instrumental variable (IV) methods are commonly applied when there are unmeasured covariates. In Mendelian randomization, constructing an allele score using many single nucleotide polymorphisms is often implemented; however, estimating biased causal effects by including some invalid IVs poses some risks. Invalid IVs are those IV candidates that are associated with unobserved variables. To solve this problem, we developed a novel strategy using negative control outcomes (NCOs) as auxiliary variables. Using NCOs, we are able to select only valid IVs and exclude invalid IVs without knowing which of the instruments are invalid. We also developed a new two-step estimation procedure and proved the semiparametric efficiency of our estimator. The performance of our proposed method was superior to some previous methods through simulations. Subsequently, we applied the proposed method to the UK Biobank dataset. Our results demonstrate that the use of an auxiliary variable, such as an NCO, enables the selection of valid IVs with assumptions different from those used in previous methods.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lutecia Servius, Davide Pigoli, Joseph Ng, Franca Fraternali
{"title":"Predicting class switch recombination in B-cells from antibody repertoire data","authors":"Lutecia Servius, Davide Pigoli, Joseph Ng, Franca Fraternali","doi":"10.1002/bimj.202300171","DOIUrl":"10.1002/bimj.202300171","url":null,"abstract":"<p>Statistical and machine learning methods have proved useful in many areas of immunology. In this paper, we address for the first time the problem of predicting the occurrence of class switch recombination (CSR) in B-cells, a problem of interest in understanding antibody response under immunological challenges. We propose a framework to analyze antibody repertoire data, based on clonal (CG) group representation in a way that allows us to predict CSR events using CG level features as input. We assess and compare the performance of several predicting models (logistic regression, LASSO logistic regression, random forest, and support vector machine) in carrying out this task. The proposed approach can obtain an unweighted average recall of <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mn>71</mn>\u0000 <mo>%</mo>\u0000 </mrow>\u0000 <annotation>$71%$</annotation>\u0000 </semantics></math> with models based on variable region descriptors and measures of CG diversity during an immune challenge and, most notably, before an immune challenge.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300171","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141089535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucia Ameis, Oliver Kuss, Annika Hoyer, Kathrin Möllenhoff
{"title":"A nonparametric proportional risk model to assess a treatment effect in time-to-event data","authors":"Lucia Ameis, Oliver Kuss, Annika Hoyer, Kathrin Möllenhoff","doi":"10.1002/bimj.202300147","DOIUrl":"10.1002/bimj.202300147","url":null,"abstract":"<p>Time-to-event analysis often relies on prior parametric assumptions, or, if a semiparametric approach is chosen, Cox's model. This is inherently tied to the assumption of proportional hazards, with the analysis potentially invalidated if this assumption is not fulfilled. In addition, most interpretations focus on the hazard ratio, that is often misinterpreted as the relative risk (RR), the ratio of the cumulative distribution functions. In this paper, we introduce an alternative to current methodology for assessing a treatment effect in a two-group situation, not relying on the proportional hazards assumption but assuming proportional risks. Precisely, we propose a new nonparametric model to directly estimate the RR of two groups to experience an event under the assumption that the risk ratio is constant over time. In addition to this relative measure, our model allows for calculating the number needed to treat as an absolute measure, providing the possibility of an easy and holistic interpretation of the data. We demonstrate the validity of the approach by means of a simulation study and present an application to data from a large randomized controlled trial investigating the effect of dapagliflozin on all-cause mortality.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300147","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141089524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marta Sestelo, Luís Meira-Machado, Nora M. Villanueva, Javier Roca-Pardiñas
{"title":"A method for determining groups in cumulative incidence curves in competing risk data","authors":"Marta Sestelo, Luís Meira-Machado, Nora M. Villanueva, Javier Roca-Pardiñas","doi":"10.1002/bimj.202300084","DOIUrl":"10.1002/bimj.202300084","url":null,"abstract":"<p>The cumulative incidence function is the standard method for estimating the marginal probability of a given event in the presence of competing risks. One basic but important goal in the analysis of competing risk data is the comparison of these curves, for which limited literature exists. We proposed a new procedure that lets us not only test the equality of these curves but also group them if they are not equal. The proposed method allows determining the composition of the groups as well as an automatic selection of their number. Simulation studies show the good numerical behavior of the proposed methods for finite sample size. The applicability of the proposed method is illustrated using real data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300084","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141077094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gregor Buch, Andreas Schulz, Irene Schmidtmann, Konstantin Strauch, Philipp S. Wild
{"title":"Sparse Group Penalties for bi-level variable selection","authors":"Gregor Buch, Andreas Schulz, Irene Schmidtmann, Konstantin Strauch, Philipp S. Wild","doi":"10.1002/bimj.202200334","DOIUrl":"10.1002/bimj.202200334","url":null,"abstract":"<p>Many data sets exhibit a natural group structure due to contextual similarities or high correlations of variables, such as lipid markers that are interrelated based on biochemical principles. Knowledge of such groupings can be used through bi-level selection methods to identify relevant feature groups and highlight their predictive members. One of the best known approaches of this kind combines the classical <i>Least Absolute Shrinkage and Selection Operator</i> (LASSO) with the <i>Group LASSO</i>, resulting in the <i>Sparse Group LASSO</i>. We propose the Sparse Group Penalty (SGP) framework, which allows for a flexible combination of different SGL-style shrinkage conditions. Analogous to SGL, we investigated the combination of the <i>Smoothly Clipped Absolute Deviation</i> (SCAD), the <i>Minimax Concave Penalty</i> (MCP) and the <i>Exponential Penalty</i> (EP) with their group versions, resulting in the <i>Sparse Group SCAD</i>, the <i>Sparse Group MCP</i>, and the novel <i>Sparse Group EP</i> (SGE). Those shrinkage operators provide refined control of the effect of group formation on the selection process through a tuning parameter. In simulation studies, SGPs were compared with other bi-level selection methods (Group Bridge, composite MCP, and Group Exponential LASSO) for variable and group selection evaluated with the Matthews correlation coefficient. We demonstrated the advantages of the new SGE in identifying parsimonious models, but also identified scenarios that highlight the limitations of the approach. The performance of the techniques was further investigated in a real-world use case for the selection of regulated lipids in a randomized clinical trial.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202200334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}