Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler
{"title":"Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control","authors":"Raphael O. Betschart, Cristian Riccio, Domingo Aguilera-Garcia, Stefan Blankenberg, Linlin Guo, Holger Moch, Dagmar Seidl, Hugo Solleder, Felix Thalén, Alexandre Thiéry, Raphael Twerenbold, Tanja Zeller, Martin Zoche, Andreas Ziegler","doi":"10.1002/bimj.202300278","DOIUrl":"10.1002/bimj.202300278","url":null,"abstract":"<p>Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141581613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Guo, Jiajia Zhang, Yichao Wu, Alexander C. McLain, James W. Hardin, Bankole Olatosi, Xiaoming Li
{"title":"Functional Multivariable Logistic Regression With an Application to HIV Viral Suppression Prediction","authors":"Siyuan Guo, Jiajia Zhang, Yichao Wu, Alexander C. McLain, James W. Hardin, Bankole Olatosi, Xiaoming Li","doi":"10.1002/bimj.202300081","DOIUrl":"10.1002/bimj.202300081","url":null,"abstract":"<p>Motivated by improving the prediction of the human immunodeficiency virus (HIV) suppression status using electronic health records (EHR) data, we propose a functional multivariable logistic regression model, which accounts for the longitudinal binary process and continuous process simultaneously. Specifically, the longitudinal measurements for either binary or continuous variables are modeled by functional principal components analysis, and their corresponding functional principal component scores are used to build a logistic regression model for prediction. The longitudinal binary data are linked to underlying Gaussian processes. The estimation is done using penalized spline for the longitudinal continuous and binary data. Group-lasso is used to select longitudinal processes, and the multivariate functional principal components analysis is proposed to revise functional principal component scores with the correlation. The method is evaluated via comprehensive simulation studies and then applied to predict viral suppression using EHR data for people living with HIV in South Carolina.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141536065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Partial True Discovery Guarantee Procedures","authors":"Ningning Xu, Aldo Solari, Jelle J. Goeman","doi":"10.1002/bimj.202300075","DOIUrl":"10.1002/bimj.202300075","url":null,"abstract":"<p>Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, “focus sets.” Still, the method also allows inference for feature sets chosen post hoc, that is, “nonfocus sets,” for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sören Budig, Klaus Jung, Mario Hasler, Frank Schaarschmidt
{"title":"Simultaneous Inference of Multiple Binary Endpoints in Biomedical Research: Small Sample Properties of Multiple Marginal Models and a Resampling Approach","authors":"Sören Budig, Klaus Jung, Mario Hasler, Frank Schaarschmidt","doi":"10.1002/bimj.202300197","DOIUrl":"10.1002/bimj.202300197","url":null,"abstract":"<p>In biomedical research, the simultaneous inference of multiple binary endpoints may be of interest. In such cases, an appropriate multiplicity adjustment is required that controls the family-wise error rate, which represents the probability of making incorrect test decisions. In this paper, we investigate two approaches that perform single-step <span></span><math>\u0000 <semantics>\u0000 <mi>p</mi>\u0000 <annotation>$p$</annotation>\u0000 </semantics></math>-value adjustments that also take into account the possible correlation between endpoints. A rather novel and flexible approach known as multiple marginal models is considered, which is based on stacking of the parameter estimates of the marginal models and deriving their joint asymptotic distribution. We also investigate a nonparametric vector-based resampling approach, and we compare both approaches with the Bonferroni method by examining the family-wise error rate and power for different parameter settings, including low proportions and small sample sizes. The results show that the resampling-based approach consistently outperforms the other methods in terms of power, while still controlling the family-wise error rate. The multiple marginal models approach, on the other hand, shows a more conservative behavior. However, it offers more versatility in application, allowing for more complex models or straightforward computation of simultaneous confidence intervals. The practical application of the methods is demonstrated using a toxicological dataset from the National Toxicology Program.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300197","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Penalized Regression Methods With Modified Cross-Validation and Bootstrap Tuning Produce Better Prediction Models","authors":"Menelaos Pavlou, Rumana Z. Omar, Gareth Ambler","doi":"10.1002/bimj.202300245","DOIUrl":"10.1002/bimj.202300245","url":null,"abstract":"<p>Risk prediction models fitted using maximum likelihood estimation (MLE) are often overfitted resulting in predictions that are too extreme and a calibration slope (CS) less than 1. Penalized methods, such as Ridge and Lasso, have been suggested as a solution to this problem as they tend to shrink regression coefficients toward zero, resulting in predictions closer to the average. The amount of shrinkage is regulated by a tuning parameter, <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>λ</mi>\u0000 <mo>,</mo>\u0000 </mrow>\u0000 <annotation>$lambda ,$</annotation>\u0000 </semantics></math> commonly selected via cross-validation (“standard tuning”). Though penalized methods have been found to improve calibration on average, they often over-shrink and exhibit large variability in the selected <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and hence the CS. This is a problem, particularly for small sample sizes, but also when using sample sizes recommended to control overfitting. We consider whether these problems are partly due to selecting <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> using cross-validation with “training” datasets of reduced size compared to the original development sample, resulting in an over-estimation of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> and, hence, excessive shrinkage. We propose a modified cross-validation tuning method (“modified tuning”), which estimates <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math> from a pseudo-development dataset obtained via bootstrapping from the original dataset, albeit of larger size, such that the resulting cross-validation training datasets are of the same size as the original dataset. Modified tuning can be easily implemented in standard software and is closely related to bootstrap selection of the tuning parameter (“bootstrap tuning”). We evaluated modified and bootstrap tuning for Ridge and Lasso in simulated and real data using recommended sample sizes, and sizes slightly lower and higher. They substantially improved the selection of <span></span><math>\u0000 <semantics>\u0000 <mi>λ</mi>\u0000 <annotation>$lambda $</annotation>\u0000 </semantics></math>, resulting in improved CS compared to the standard tuning method. They also improved predictions compared to MLE.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 5","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141460887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Causal inference in the absence of positivity: The role of overlap weights","authors":"Roland A. Matsouaka, Yunji Zhou","doi":"10.1002/bimj.202300156","DOIUrl":"10.1002/bimj.202300156","url":null,"abstract":"<p>How to analyze data when there is violation of the positivity assumption? Several possible solutions exist in the literature. In this paper, we consider propensity score (PS) methods that are commonly used in observational studies to assess causal treatment effects in the context where the positivity assumption is violated. We focus on and examine four specific alternative solutions to the inverse probability weighting (IPW) trimming and truncation: matching weight (MW), Shannon's entropy weight (EW), overlap weight (OW), and beta weight (BW) estimators.</p><p>We first specify their target population, the population of patients for whom clinical equipoise, that is, where we have sufficient PS overlap. Then, we establish the nexus among the different corresponding weights (and estimators); this allows us to highlight the shared properties and theoretical implications of these estimators. Finally, we introduce their augmented estimators that take advantage of estimating both the propensity score and outcome regression models to enhance the treatment effect estimators in terms of bias and efficiency. We also elucidate the role of the OW estimator as the flagship of all these methods that target the overlap population.</p><p>Our analytic results demonstrate that OW, MW, and EW are preferable to IPW and some cases of BW when there is a moderate or extreme (stochastic or structural) violation of the positivity assumption. We then evaluate, compare, and confirm the finite-sample performance of the aforementioned estimators via Monte Carlo simulations. Finally, we illustrate these methods using two real-world data examples marked by violations of the positivity assumption.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe
{"title":"Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values","authors":"Benjamin Planterose Jiménez, Manfred Kayser, Athina Vidaki, Amke Caliebe","doi":"10.1002/bimj.202300090","DOIUrl":"10.1002/bimj.202300090","url":null,"abstract":"<p>Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300090","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141177096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Bayesian hierarchical hidden Markov model for clustering and gene selection: Application to kidney cancer gene expression data","authors":"Thierry Chekouo, Himadri Mukherjee","doi":"10.1002/bimj.202300173","DOIUrl":"10.1002/bimj.202300173","url":null,"abstract":"<p>We introduce a Bayesian approach for biclustering that accounts for the prior functional dependence between genes using hidden Markov models (HMMs). We utilize biological knowledge gathered from gene ontologies and the hidden Markov structure to capture the potential coexpression of neighboring genes. Our interpretable model-based clustering characterized each cluster of samples by three groups of features: overexpressed, underexpressed, and irrelevant features. The proposed methods have been implemented in an R package and are used to analyze both the simulated data and The Cancer Genome Atlas kidney cancer data.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202300173","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141181467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Valid instrumental variable selection method using negative control outcomes and constructing efficient estimator","authors":"Shunichiro Orihara, Atsushi Goto, Masataka Taguri","doi":"10.1002/bimj.202300113","DOIUrl":"10.1002/bimj.202300113","url":null,"abstract":"<p>In observational studies, instrumental variable (IV) methods are commonly applied when there are unmeasured covariates. In Mendelian randomization, constructing an allele score using many single nucleotide polymorphisms is often implemented; however, estimating biased causal effects by including some invalid IVs poses some risks. Invalid IVs are those IV candidates that are associated with unobserved variables. To solve this problem, we developed a novel strategy using negative control outcomes (NCOs) as auxiliary variables. Using NCOs, we are able to select only valid IVs and exclude invalid IVs without knowing which of the instruments are invalid. We also developed a new two-step estimation procedure and proved the semiparametric efficiency of our estimator. The performance of our proposed method was superior to some previous methods through simulations. Subsequently, we applied the proposed method to the UK Biobank dataset. Our results demonstrate that the use of an auxiliary variable, such as an NCO, enables the selection of valid IVs with assumptions different from those used in previous methods.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}