Nicole M Warrington, Kate Tilling, Laura D Howe, Lavinia Paternoster, Craig E Pennell, Yan Yan Wu, Laurent Briollais
{"title":"Robustness of the linear mixed effects model to error distribution assumptions and the consequences for genome-wide association studies.","authors":"Nicole M Warrington, Kate Tilling, Laura D Howe, Lavinia Paternoster, Craig E Pennell, Yan Yan Wu, Laurent Briollais","doi":"10.1515/sagmb-2013-0066","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0066","url":null,"abstract":"<p><p>Genome-wide association studies have been successful in uncovering novel genetic variants that are associated with disease status or cross-sectional phenotypic traits. Researchers are beginning to investigate how genes play a role in the development of a trait over time. Linear mixed effects models (LMM) are commonly used to model longitudinal data; however, it is unclear if the failure to meet the models distributional assumptions will affect the conclusions when conducting a genome-wide association study. In an extensive simulation study, we compare coverage probabilities, bias, type 1 error rates and statistical power when the error of the LMM is either heteroscedastic or has a non-Gaussian distribution. We conclude that the model is robust to misspecification if the same function of age is included in the fixed and random effects. However, type 1 error of the genetic effect over time is inflated, regardless of the model misspecification, if the polynomial function for age in the fixed and random effects differs. In situations where the model will not converge with a high order polynomial function in the random effects, a reduced function can be used but a robust standard error needs to be calculated to avoid inflation of the type 1 error. As an illustration, a LMM was applied to longitudinal body mass index (BMI) data over childhood in the ALSPAC cohort; the results emphasised the need for the robust standard error to ensure correct inference of associations of longitudinal BMI with chromosome 16 single nucleotide polymorphisms.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 5","pages":"567-87"},"PeriodicalIF":0.9,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32611032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah E Heaps, Tom M W Nye, Richard J Boys, Tom A Williams, T Martin Embley
{"title":"Bayesian modelling of compositional heterogeneity in molecular phylogenetics.","authors":"Sarah E Heaps, Tom M W Nye, Richard J Boys, Tom A Williams, T Martin Embley","doi":"10.1515/sagmb-2013-0077","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0077","url":null,"abstract":"<p><p>In molecular phylogenetics, standard models of sequence evolution generally assume that sequence composition remains constant over evolutionary time. However, this assumption is violated in many datasets which show substantial heterogeneity in sequence composition across taxa. We propose a model which allows compositional heterogeneity across branches, and formulate the model in a Bayesian framework. Specifically, the root and each branch of the tree is associated with its own composition vector whilst a global matrix of exchangeability parameters applies everywhere on the tree. We encourage borrowing of strength between branches by developing two possible priors for the composition vectors: one in which information can be exchanged equally amongst all branches of the tree and another in which more information is exchanged between neighbouring branches than between distant branches. We also propose a Markov chain Monte Carlo (MCMC) algorithm for posterior inference which uses data augmentation of substitutional histories to yield a simple complete data likelihood function that factorises over branches and allows Gibbs updates for most parameters. Standard phylogenetic models are not informative about the root position. Therefore a significant advantage of the proposed model is that it allows inference about rooted trees. The position of the root is fundamental to the biological interpretation of trees, both for polarising trait evolution and for establishing the order of divergence among lineages. Furthermore, unlike some other related models from the literature, inference in the model we propose can be carried out through a simple MCMC scheme which does not require problematic dimension-changing moves. We investigate the performance of the model and priors in analyses of two alignments for which there is strong biological opinion about the tree topology and root position.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 5","pages":"589-609"},"PeriodicalIF":0.9,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32611034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene set analysis for GWAS: assessing the use of modified Kolmogorov-Smirnov statistics.","authors":"Birgit Debrabant, Mette Soerensen","doi":"10.1515/sagmb-2013-0015","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0015","url":null,"abstract":"<p><p>We discuss the use of modified Kolmogorov-Smirnov (KS) statistics in the context of gene set analysis and review corresponding null and alternative hypotheses. Especially, we show that, when enhancing the impact of highly significant genes in the calculation of the test statistic, the corresponding test can be considered to infer the classical self-contained null hypothesis. We use simulations to estimate the power for different kinds of alternatives, and to assess the impact of the weight parameter of the modified KS statistic on the power. Finally, we show the analogy between the weight parameter and the genesis and distribution of the gene-level statistics, and illustrate the effects of differential weighting in a real-life example.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 5","pages":"553-66"},"PeriodicalIF":0.9,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0015","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32611031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian identification of protein differential expression in multi-group isobaric labelled mass spectrometry data.","authors":"Howsun Jow, Richard J Boys, Darren J Wilkinson","doi":"10.1515/sagmb-2012-0066","DOIUrl":"10.1515/sagmb-2012-0066","url":null,"abstract":"<p><p>In this paper we develop a Bayesian statistical inference approach to the unified analysis of isobaric labelled MS/MS proteomic data across multiple experiments. An explicit probabilistic model of the log-intensity of the isobaric labels' reporter ions across multiple pre-defined groups and experiments is developed. This is then used to develop a full Bayesian statistical methodology for the identification of differentially expressed proteins, with respect to a control group, across multiple groups and experiments. This methodology is implemented and then evaluated on simulated data and on two model experimental datasets (for which the differentially expressed proteins are known) that use a TMT labelling protocol.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 5","pages":"531-51"},"PeriodicalIF":0.9,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32611033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantifying the multi-scale performance of network inference algorithms.","authors":"Chris J Oates, Richard Amos, Simon E F Spencer","doi":"10.1515/sagmb-2014-0012","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0012","url":null,"abstract":"<p><p>Graphical models are widely used to study complex multivariate biological systems. Network inference algorithms aim to reverse-engineer such models from noisy experimental data. It is common to assess such algorithms using techniques from classifier analysis. These metrics, based on ability to correctly infer individual edges, possess a number of appealing features including invariance to rank-preserving transformation. However, regulation in biological systems occurs on multiple scales and existing metrics do not take into account the correctness of higher-order network structure. In this paper novel performance scores are presented that share the appealing properties of existing scores, whilst capturing ability to uncover regulation on multiple scales. Theoretical results confirm that performance of a network inference algorithm depends crucially on the scale at which inferences are to be made; in particular strong local performance does not guarantee accurate reconstruction of higher-order topology. Applying these scores to a large corpus of data from the DREAM5 challenge, we undertake a data-driven assessment of estimator performance. We find that the \"wisdom of crowds\" network, that demonstrated superior local performance in the DREAM5 challenge, is also among the best performing methodologies for inference of regulation on multiple length scales.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 5","pages":"611-31"},"PeriodicalIF":0.9,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0012","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32610678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining dependent F-tests for robust association of quantitative traits under genetic model uncertainty.","authors":"Long Qu","doi":"10.1515/sagmb-2013-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0001","url":null,"abstract":"<p><p>In association mapping of quantitative traits, the F-test based on an assumed genetic model is a basic statistical tool for testing association of each candidate locus with the trait of interest. However, the true underlying genetic model is often unknown, and using an incorrect model may cause serious loss of power. For case-control studies, it is known that the combination of several tests that are optimal for different models is robust to model misspecification. In this paper, we extend the test combination approach to quantitative trait association. We first derive the exact correlations among transformed test statistics and discuss interesting special cases. We then propose and evaluate a multivariate normality based approximation to the joint distribution of test statistics, such that the marginal distributions and pairwise correlations among test statistics are accounted for. Through simulations, we show that the sizes of the resulting approximate combined tests are accurate for practical purposes under a variety of situations. We find that the combination of the tests from the additive model and the genotypic model performs well, because it demonstrates both robustness to incorrect models and satisfactory power. A mouse lipoprotein data set is used to demonstrate the method.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 2","pages":"123-39"},"PeriodicalIF":0.9,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40289963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling, simulation and analysis of methylation profiles from reduced representation bisulfite sequencing experiments.","authors":"Michelle R Lacey, Carl Baribault, Melanie Ehrlich","doi":"10.1515/sagmb-2013-0027","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0027","url":null,"abstract":"<p><p>The ENCODE project has funded the generation of a diverse collection of methylation profiles using reduced representation bisulfite sequencing (RRBS) technology, enabling the analysis of epigenetic variation on a genomic scale at single-site resolution. A standard application of RRBS experiments is in the location of differentially methylated regions (DMRs) between two sets of samples. Despite numerous publications reporting DMRs identified from RRBS datasets, there have been no formal analyses of the effects of experimental and biological factors on the performance of existing or newly developed analytical methods. These factors include variable read coverage, differing group sample sizes across genomic regions, uneven spacing between CpG dinucleotide sites, and correlation in methylation levels among sites in close proximity. To better understand the interplay among technical and biological variables in the analysis of RRBS methylation profiles, we have developed an algorithm for the generation of experimentally realistic RRBS datasets. Applying insights derived from our simulation studies, we present a novel procedure that can identify DMRs spanning as few as three CpG sites with both high sensitivity and specificity. Using RRBS data from muscle vs. non-muscle cell cultures as an example, we demonstrate that our method reveals many more DMRs that are likely to be of biological significance than previous methods.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"12 6","pages":"723-42"},"PeriodicalIF":0.9,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40269820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kernel approximate Bayesian computation in population genetic inferences.","authors":"Shigeki Nakagome, Kenji Fukumizu, Shuhei Mano","doi":"10.1515/sagmb-2012-0050","DOIUrl":"https://doi.org/10.1515/sagmb-2012-0050","url":null,"abstract":"<p><p>Approximate Bayesian computation (ABC) is a likelihood-free approach for Bayesian inferences based on a rejection algorithm method that applies a tolerance of dissimilarity between summary statistics from observed and simulated data. Although several improvements to the algorithm have been proposed, none of these improvements avoid the following two sources of approximation: 1) lack of sufficient statistics: sampling is not from the true posterior density given data but from an approximate posterior density given summary statistics; and 2) non-zero tolerance: sampling from the posterior density given summary statistics is achieved only in the limit of zero tolerance. The first source of approximation can be improved by adding a summary statistic, but an increase in the number of summary statistics could introduce additional variance caused by the low acceptance rate. Consequently, many researchers have attempted to develop techniques to choose informative summary statistics. The present study evaluated the utility of a kernel-based ABC method [Fukumizu, K., L. Song and A. Gretton (2010): \"Kernel Bayes' rule: Bayesian inference with positive definite kernels,\" arXiv, 1009.5736 and Fukumizu, K., L. Song and A. Gretton (2011): \"Kernel Bayes' rule. Advances in Neural Information Processing Systems 24.\" In: J. Shawe-Taylor and R. S. Zemel and P. Bartlett and F. Pereira and K. Q. Weinberger, (Eds.), pp. 1549-1557., NIPS 24: 1549-1557] for complex problems that demand many summary statistics. Specifically, kernel ABC was applied to population genetic inference. We demonstrate that, in contrast to conventional ABCs, kernel ABC can incorporate a large number of summary statistics while maintaining high performance of the inference.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"12 6","pages":"667-78"},"PeriodicalIF":0.9,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2012-0050","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40256700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An order estimation based approach to identify response genes for microarray time course data.","authors":"Zhiheng K Lu, O Brian Allen, Anthony F Desmond","doi":"10.1515/1544-6115.1818","DOIUrl":"https://doi.org/10.1515/1544-6115.1818","url":null,"abstract":"<p><p>Gene expression profiles from microarray time course experiments provide a unique opportunity to examine genome-wide signal processing and gene responses. A fundamental issue in microarray experiments is that the treatment condition can only be controlled at the cell level rather than at the gene level. The treatment condition does not affect all genes equally. Some genes depend on other genes to detect external changes. The dependency between genes is not fully deterministic and may vary with treatment condition. Thus the expression of each gene is potentially affected by two confounding effects: the treatment effect and the gene context effect arising from the regulatory interactions among genes. This gene context effect is hard to isolate. Neither can it be simply ignored. Instead, this gene context information which may be different under different treatment conditions is of primary biological interest. We introduce an approach which deals with the confounding effects and takes into account the uncontrollable gene context effect. Our method is based on the estimation of the number of hidden states, which, in our development, corresponds to the order of a hidden Markov model (HMM). For each gene, its observed expression is modeled by a gamma distribution determined by the corresponding hidden state at each time point. Those genes showing evidence for more than one hidden state can be categorized as the signalling genes, or in a wider sense, as the response genes which are coordinated by a cell system in reaction to a specific external condition. These response genes can be used in the comparison of different treatment conditions, to investigate the gene context effect under different treatments. Microarray time course data are also analyzed to demonstrate our method.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/1544-6115.1818","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31121153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical shrinkage priors and model fitting for high-dimensional generalized linear models.","authors":"Nengjun Yi, Shuangge Ma","doi":"10.1515/1544-6115.1803","DOIUrl":"10.1515/1544-6115.1803","url":null,"abstract":"<p><p>Abstract Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"11 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2012-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3658361/pdf/nihms-466426.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31081969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}