T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer
{"title":"Determining the number of components in PLS regression on incomplete data set","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0059","url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46367347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen
{"title":"EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples","authors":"Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen","doi":"10.1101/401232","DOIUrl":"https://doi.org/10.1101/401232","url":null,"abstract":"Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47425887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nele Cosemans, Peter Claes, Nathalie Brison, Joris Robert Vermeesch, Hilde Peeters
{"title":"Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios.","authors":"Nele Cosemans, Peter Claes, Nathalie Brison, Joris Robert Vermeesch, Hilde Peeters","doi":"10.1515/sagmb-2017-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0026","url":null,"abstract":"<p><p>Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36054869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On \"A mutual information estimator with exponentially decaying bias\" by Zhang and Zheng.","authors":"Jialin Zhang, Chen Chen","doi":"10.1515/sagmb-2018-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0005","url":null,"abstract":"<p><p>Zhang, Z. and Zheng, L. (2015): \"A mutual information estimator with exponentially decaying bias,\" Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35962346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman
{"title":"Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting.","authors":"Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman","doi":"10.1515/sagmb-2017-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0038","url":null,"abstract":"<p><p>Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35840212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Additive varying-coefficient model for nonlinear gene-environment interactions.","authors":"Cen Wu, Ping-Shou Zhong, Yuehua Cui","doi":"10.1515/sagmb-2017-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0008","url":null,"abstract":"<p><p>Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35810903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiehuan Sun, Jose D Herazo-Maya, Xiu Huang, Naftali Kaminski, Hongyu Zhao
{"title":"Distance-correlation based gene set analysis in longitudinal studies.","authors":"Jiehuan Sun, Jose D Herazo-Maya, Xiu Huang, Naftali Kaminski, Hongyu Zhao","doi":"10.1515/sagmb-2017-0053","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0053","url":null,"abstract":"<p><p>Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0053","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35791378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tests for comparison of multiple endpoints with application to omics data.","authors":"Marco Marozzi","doi":"10.1515/sagmb-2017-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0033","url":null,"abstract":"<p><p>In biomedical research, multiple endpoints are commonly analyzed in \"omics\" fields like genomics, proteomics and metabolomics. Traditional methods designed for low-dimensional data either perform poorly or are not applicable when analyzing high-dimensional data whose dimension is generally similar to, or even much larger than, the number of subjects. The complex biochemical interplay between hundreds (or thousands) of endpoints is reflected by complex dependence relations. The aim of the paper is to propose tests that are very suitable for analyzing omics data because they do not require the normality assumption, are powerful also for small sample sizes, in the presence of complex dependence relations among endpoints, and when the number of endpoints is much larger than the number of subjects. Unbiasedness and consistency of the tests are proved and their size and power are assessed numerically. It is shown that the proposed approach based on the nonparametric combination of dependent interpoint distance tests is very effective. Applications to genomics and metabolomics are discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35776896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-parametric estimation of population size changes from the site frequency spectrum","authors":"B. L. Waltoft, A. Hobolth","doi":"10.1101/125351","DOIUrl":"https://doi.org/10.1101/125351","url":null,"abstract":"Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2017-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47621345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polyunphased: an extension to polytomous outcomes of the Unphased package for family-based genetic association analysis","authors":"A. Bureau, J. Croteau","doi":"10.1515/sagmb-2016-0035","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0035","url":null,"abstract":"Abstract Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface. Most Unphased options remain available in Polyunphased, including those handling missing parental genotypes while preserving robustness to population stratification, and the modelling options. Simulation studies confirmed the expected statistical behaviour of the maximum likelihood estimates of the association parameters of the conditional logistic regression model when the corresponding association parameters in the parental term of the likelihood function are set to 0, but revealed convergence problems when estimating these parental association parameters separately. The former approach is thus recommended with polytomous phenotypes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"75 - 81"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42874518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}