遗传学中的统计建模和推断

Handbook of Statistical Genomics Pub Date : 2019-08-09 DOI:10.1002/9781119487845.ch1

Daniel Wegmann, C. Leuenberger

{"title":"遗传学中的统计建模和推断","authors":"Daniel Wegmann, C. Leuenberger","doi":"10.1002/9781119487845.ch1","DOIUrl":null,"url":null,"abstract":"Given the long mathematical history and tradition in genetics, and particularly in population genetics, it is not surprising that model-based statistical inference has always been an integral part of statistical genetics, and vice versa. Since the big data revolution due to novel sequencing technologies, statistical genetics has further relied heavily on numerical methods for inference. In this chapter we give a brief overview over the foundations of statistical inference, including both the frequentist and Bayesian schools, and introduce analytical and numerical methods commonly applied in statistical genetics. A particular focus is put on recent approximate techniques that now play an important role in several fields of statistical genetics. We conclude by applying several of the algorithms introduced to hidden Markov models, which have been used very successfully to model processes along chromosomes. Throughout we strive for the impossible task of making the material accessible to readers with limited statistical background, while hoping that it will also constitute a worthy refresher for more advanced readers. Readers who already have a solid statistical background may safely skip the first introductory part and jump directly to Section 1.2.3. 1.1 Statistical Models and Inference Statistical inference offers a formal approach to characterizing a random phenomenon using observations, either by providing a description of a past phenomenon, or by giving some predictions about future phenomena of a similar nature. This is typically done by estimating a vector of parameters θ from on a vector of observations or data , using the formal framework and laws of probability. The interpretation of probabilities is a somewhat contentious issue, with multiple competing interpretations. Specifically, probabilities can be seen as the frequencies with which specific events occur in a repeatable experiment (the frequentist interpretation; Lehmann and Casella, 2006), or as reflecting the uncertainty or degree of belief about the state of a random variable (the Bayesian interpretation; Robert, 2007). In frequentist statistics, only  is thus considered as a random variable, while in Bayesian statistics both and θ are considered random variables. The goal of this chapter is not, however, to enter any debate about the validity of the two competing schools of thought. Instead, our aim is to introduce the most commonly used inference methods of both schools. Indeed, most researchers in statistical genetics, including ourselves, choose their approaches pragmatically based on computational considerations rather than Handbook of Statistical Genomics, Fourth Edition, Volume 1. Edited by David J. Balding, Ida Moltke and John Marioni. © 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. CO PY RI GH TE D M AT ER IA L JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 2 D. Wegmann and C. Leuenberger strong philosophical grounds. Yet, the two schools differ slightly in their language. To keep the introduction succinct and consistent, we introduce the basic concepts of statistical modeling first from the Bayesian point of view. The main differences with respect to the frequentist view are then discussed below. 1.1.1 Statistical Models 1.1.1.1 Independence Assumptions The first step in statistical inference is to specify a statistical model, which consists of identifying all relevant variables ,θ and formulating the joint probability distribution P(,θ) of their interaction, usually under some simplifying assumptions.1 It is hard to overestimate the importance of this step: ignoring a variable makes the strong assumption that this variable is independent or conditionally independent of all variables considered. By focusing on a summary statistic or subset T() of the data , for instance, it is implied that T() contains all information about θ present in . Similarly, all variables not included in θ are assumed to be independent of conditioned on θ. A third type of assumption that is often made is to consider specific variables to be conditionally independent of each other. That is particularly relevant in hierarchical models where the probability distribution of one parameter is dependent on the values of other hierarchical parameters. Example 1.1 (Allele frequencies). We strive to illustrate all concepts in this chapter through a limited number of compelling scenarios that we revisit frequently. One of these is the problem of inferring the frequency f of the derived allele at a bi-allelic locus from DNA sequence data. While f may denote the frequency of either of the two alleles, we will assume here, without loss of generality, that the two alleles can be polarized into the ancestral and derived allele, where the latter arose from the former through a mutation. Consider now DNA sequence data d = {d1,... , dn} obtained for n diploid individuals with sequencing errors at rate ε. Obviously, f could easily be calculated if all genotypes were known. However, using a statistical model that properly accounts for genotyping uncertainty, a hierarchical parameter such as f can be estimated from much less data (and hence sequencing depth) than would be necessary to accurately infer all n genotypes. An appropriate statistical model with parameters θ = { f , ε} and data  = d might look as follows: P(d, f , ε) = P(d| f , ε)P( f , ε) = [ n ∏ i=1 ∑ gi P(di|gi, ε)P( gi| f )]P( f , ε). (1.1) Here, the sum runs over all possible values of the unknown genotypes gi. The model introduced in Example 1.1 makes the strong assumptions that the only relevant variables are the sequencing data d, the unknown genotypes g = {g1,... , gn}, the sequencing error rate ε and the allele frequency f . In addition, the model makes the conditional independence assumptions P(di|gi, ε, f , d−i) = P(di|gi, ε) that the sequencing data di obtained for individual i is independent of f and the sequencing data of all other individuals d−i when conditioning on a particular genotype gi. Variables may also become conditionally dependent, as do, for instance, f and ε once specific data is considered in the above model. Undeniably, the data d constrains ε and f : observing around 5% of derived alleles, for instance, is only compatible with f = 0 if ε ≈ 0.05, but not 1 To keep the notation simple, we will denote by P(⋅) the probability of both discrete and continuous variables. Also, we will typically assume the continuous case when describing general concepts and thus use integrals instead of sums. JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 1 Statistical Modeling and Inference in Genetics 3 (a) (b) Figure 1.1 (a) Directed acyclic graph (DAG) representing the independence assumptions of Example 1.1 as given in equation (1.1). Observed data is shown as squares; unknown variables as circles. (b) The same DAG in plate notation, where a plate replicates the inside quantities as many times as specified in the plate (here n times). with a much lower ε. This also highlights that statistical dependence in a model never implies causality in nature. Indeed, the allele frequency does not causally affect the error rate of the sequencing machine, yet in the model the two variables f and ε are dependent as they are connected through the data d. Importantly, therefore, a statistical model is not a statement about causality, but only about (conditional) independence assumptions. An excellent discussion on this is given in Barber (2012, Ch. 2). It is often helpful to illustrate the specific independence assumptions of a statistical model graphically using a so-called directed acyclic graph (DAG; Barber, 2012; Koller and Friedman, 2009). In a DAG, each variable xi is a node, and any variable xj from which a directed edge points to xi is considered a parental variable of xi. A DAG for the model of Example 1.1 is given in Figure 1.1, from which the independence assumptions of the model are easily read: (1) Each variable in the DAG is assumed to be independent of any variable not included in the DAG. (2) Each variable is assumed not to be independent of its parental variables. In our case, for instance, we assume that the data di of individual i is not independent of the genotype gi, nor the sequencing error rate ε. (3) Each pair of variables a, b connected as a → x → b or a ← x → b is independent when conditioning on x. In our example, all di are independent of f and all other dj, j ≠ i, when conditioning on gi and ε. (4) If variables a, b are connected as a → x ← b, x is called a collider; conditioning on it, a and b become dependent. In our example, ε and gi are thus not independent as soon as specific data di is considered. The same holds for ε and f , unless we additionally condition on all gi. Let us recall at this point that a frequentist would discuss the above concepts with a slightly different vocabulary. 1.1.1.2 Probability Distributions Once independence assumptions are set, explicit assumptions on the probability distributions have to be made. We note that this is not a requirement for so-called nonparametric statistical approaches. However, we will not consider these here because most nonparametric approaches are either restricted to hypothesis testing or only justified when sample sizes are very large, while many problems in genetics have to be solved with limited data. Instead, we will focus on parametric modeling and assume that the observations  were generated from parameterized probability distributions P(|θ) with unknown parameters θ, but known function P, which thus need to be specified. Example 1.2 (Allele frequency). For the model of Example 1.1 given in equation (1.1), two probability functions have to be specified: P(di|gi, ε) and P( gi| f ). For the latter, we might be willing to assume that genotypes are in Hardy–Weinberg equilibrium (Hardy, 1908; Weinberg, 1908), such that P( g| f ) = 2g ) f","PeriodicalId":216924,"journal":{"name":"Handbook of Statistical Genomics","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Statistical Modeling and Inference in Genetics\",\"authors\":\"Daniel Wegmann, C. Leuenberger\",\"doi\":\"10.1002/9781119487845.ch1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given the long mathematical history and tradition in genetics, and particularly in population genetics, it is not surprising that model-based statistical inference has always been an integral part of statistical genetics, and vice versa. Since the big data revolution due to novel sequencing technologies, statistical genetics has further relied heavily on numerical methods for inference. In this chapter we give a brief overview over the foundations of statistical inference, including both the frequentist and Bayesian schools, and introduce analytical and numerical methods commonly applied in statistical genetics. A particular focus is put on recent approximate techniques that now play an important role in several fields of statistical genetics. We conclude by applying several of the algorithms introduced to hidden Markov models, which have been used very successfully to model processes along chromosomes. Throughout we strive for the impossible task of making the material accessible to readers with limited statistical background, while hoping that it will also constitute a worthy refresher for more advanced readers. Readers who already have a solid statistical background may safely skip the first introductory part and jump directly to Section 1.2.3. 1.1 Statistical Models and Inference Statistical inference offers a formal approach to characterizing a random phenomenon using observations, either by providing a description of a past phenomenon, or by giving some predictions about future phenomena of a similar nature. This is typically done by estimating a vector of parameters θ from on a vector of observations or data , using the formal framework and laws of probability. The interpretation of probabilities is a somewhat contentious issue, with multiple competing interpretations. Specifically, probabilities can be seen as the frequencies with which specific events occur in a repeatable experiment (the frequentist interpretation; Lehmann and Casella, 2006), or as reflecting the uncertainty or degree of belief about the state of a random variable (the Bayesian interpretation; Robert, 2007). In frequentist statistics, only  is thus considered as a random variable, while in Bayesian statistics both and θ are considered random variables. The goal of this chapter is not, however, to enter any debate about the validity of the two competing schools of thought. Instead, our aim is to introduce the most commonly used inference methods of both schools. Indeed, most researchers in statistical genetics, including ourselves, choose their approaches pragmatically based on computational considerations rather than Handbook of Statistical Genomics, Fourth Edition, Volume 1. Edited by David J. Balding, Ida Moltke and John Marioni. © 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. CO PY RI GH TE D M AT ER IA L JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 2 D. Wegmann and C. Leuenberger strong philosophical grounds. Yet, the two schools differ slightly in their language. To keep the introduction succinct and consistent, we introduce the basic concepts of statistical modeling first from the Bayesian point of view. The main differences with respect to the frequentist view are then discussed below. 1.1.1 Statistical Models 1.1.1.1 Independence Assumptions The first step in statistical inference is to specify a statistical model, which consists of identifying all relevant variables ,θ and formulating the joint probability distribution P(,θ) of their interaction, usually under some simplifying assumptions.1 It is hard to overestimate the importance of this step: ignoring a variable makes the strong assumption that this variable is independent or conditionally independent of all variables considered. By focusing on a summary statistic or subset T() of the data , for instance, it is implied that T() contains all information about θ present in . Similarly, all variables not included in θ are assumed to be independent of conditioned on θ. A third type of assumption that is often made is to consider specific variables to be conditionally independent of each other. That is particularly relevant in hierarchical models where the probability distribution of one parameter is dependent on the values of other hierarchical parameters. Example 1.1 (Allele frequencies). We strive to illustrate all concepts in this chapter through a limited number of compelling scenarios that we revisit frequently. One of these is the problem of inferring the frequency f of the derived allele at a bi-allelic locus from DNA sequence data. While f may denote the frequency of either of the two alleles, we will assume here, without loss of generality, that the two alleles can be polarized into the ancestral and derived allele, where the latter arose from the former through a mutation. Consider now DNA sequence data d = {d1,... , dn} obtained for n diploid individuals with sequencing errors at rate ε. Obviously, f could easily be calculated if all genotypes were known. However, using a statistical model that properly accounts for genotyping uncertainty, a hierarchical parameter such as f can be estimated from much less data (and hence sequencing depth) than would be necessary to accurately infer all n genotypes. An appropriate statistical model with parameters θ = { f , ε} and data  = d might look as follows: P(d, f , ε) = P(d| f , ε)P( f , ε) = [ n ∏ i=1 ∑ gi P(di|gi, ε)P( gi| f )]P( f , ε). (1.1) Here, the sum runs over all possible values of the unknown genotypes gi. The model introduced in Example 1.1 makes the strong assumptions that the only relevant variables are the sequencing data d, the unknown genotypes g = {g1,... , gn}, the sequencing error rate ε and the allele frequency f . In addition, the model makes the conditional independence assumptions P(di|gi, ε, f , d−i) = P(di|gi, ε) that the sequencing data di obtained for individual i is independent of f and the sequencing data of all other individuals d−i when conditioning on a particular genotype gi. Variables may also become conditionally dependent, as do, for instance, f and ε once specific data is considered in the above model. Undeniably, the data d constrains ε and f : observing around 5% of derived alleles, for instance, is only compatible with f = 0 if ε ≈ 0.05, but not 1 To keep the notation simple, we will denote by P(⋅) the probability of both discrete and continuous variables. Also, we will typically assume the continuous case when describing general concepts and thus use integrals instead of sums. JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 1 Statistical Modeling and Inference in Genetics 3 (a) (b) Figure 1.1 (a) Directed acyclic graph (DAG) representing the independence assumptions of Example 1.1 as given in equation (1.1). Observed data is shown as squares; unknown variables as circles. (b) The same DAG in plate notation, where a plate replicates the inside quantities as many times as specified in the plate (here n times). with a much lower ε. This also highlights that statistical dependence in a model never implies causality in nature. Indeed, the allele frequency does not causally affect the error rate of the sequencing machine, yet in the model the two variables f and ε are dependent as they are connected through the data d. Importantly, therefore, a statistical model is not a statement about causality, but only about (conditional) independence assumptions. An excellent discussion on this is given in Barber (2012, Ch. 2). It is often helpful to illustrate the specific independence assumptions of a statistical model graphically using a so-called directed acyclic graph (DAG; Barber, 2012; Koller and Friedman, 2009). In a DAG, each variable xi is a node, and any variable xj from which a directed edge points to xi is considered a parental variable of xi. A DAG for the model of Example 1.1 is given in Figure 1.1, from which the independence assumptions of the model are easily read: (1) Each variable in the DAG is assumed to be independent of any variable not included in the DAG. (2) Each variable is assumed not to be independent of its parental variables. In our case, for instance, we assume that the data di of individual i is not independent of the genotype gi, nor the sequencing error rate ε. (3) Each pair of variables a, b connected as a → x → b or a ← x → b is independent when conditioning on x. In our example, all di are independent of f and all other dj, j ≠ i, when conditioning on gi and ε. (4) If variables a, b are connected as a → x ← b, x is called a collider; conditioning on it, a and b become dependent. In our example, ε and gi are thus not independent as soon as specific data di is considered. The same holds for ε and f , unless we additionally condition on all gi. Let us recall at this point that a frequentist would discuss the above concepts with a slightly different vocabulary. 1.1.1.2 Probability Distributions Once independence assumptions are set, explicit assumptions on the probability distributions have to be made. We note that this is not a requirement for so-called nonparametric statistical approaches. However, we will not consider these here because most nonparametric approaches are either restricted to hypothesis testing or only justified when sample sizes are very large, while many problems in genetics have to be solved with limited data. Instead, we will focus on parametric modeling and assume that the observations  were generated from parameterized probability distributions P(|θ) with unknown parameters θ, but known function P, which thus need to be specified. Example 1.2 (Allele frequency). For the model of Example 1.1 given in equation (1.1), two probability functions have to be specified: P(di|gi, ε) and P( gi| f ). For the latter, we might be willing to assume that genotypes are in Hardy–Weinberg equilibrium (Hardy, 1908; Weinberg, 1908), such that P( g| f ) = 2g ) f\",\"PeriodicalId\":216924,\"journal\":{\"name\":\"Handbook of Statistical Genomics\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Handbook of Statistical Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/9781119487845.ch1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Statistical Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/9781119487845.ch1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

考虑到遗传学中悠久的数学历史和传统，特别是在群体遗传学中，基于模型的统计推断一直是统计遗传学的一个组成部分，反之亦然，这并不奇怪。由于新的测序技术带来了大数据革命，统计遗传学进一步严重依赖数值方法进行推断。在本章中，我们简要概述了统计推断的基础，包括频率学派和贝叶斯学派，并介绍了统计遗传学中常用的分析和数值方法。一个特别的焦点放在最近的近似技术，现在在统计遗传学的几个领域发挥重要作用。最后，我们将几种算法应用到隐马尔可夫模型中，这些算法已经非常成功地用于沿染色体的过程建模。在整个过程中，我们努力完成一项不可能完成的任务，即使具有有限统计背景的读者能够访问这些材料，同时希望它也能为更高级的读者提供有价值的复习。已经有坚实统计背景的读者可以安全地跳过第一个介绍部分，直接跳到1.2.3节。统计推断提供了一种正式的方法，通过对过去现象的描述，或者对未来类似性质的现象进行预测，来描述一个随机现象。这通常是通过使用正式框架和概率定律，从观测或数据向量上估计参数向量θ来完成的。概率的解释是一个有争议的问题，有多种相互竞争的解释。具体来说，概率可以看作是在可重复实验中特定事件发生的频率(频率主义者的解释;Lehmann和Casella, 2006)，或者反映了对随机变量状态的不确定性或相信程度(贝叶斯解释;罗伯特,2007)。在频率统计中，只有被认为是随机变量，而在贝叶斯统计中和θ都被认为是随机变量。然而，本章的目的并不是要对这两种相互竞争的思想流派的有效性进行任何辩论。相反，我们的目的是介绍两种学派最常用的推理方法。事实上，大多数统计遗传学研究人员，包括我们自己，选择他们的方法是基于计算的考虑，而不是统计基因组学手册，第四版，卷1。由大卫·j·鲍尔丁、艾达·毛奇和约翰·马里奥尼编辑。©2019 John Wiley & Sons Ltd。2019年由John Wiley & Sons Ltd出版。JWST943-c01 jwst943 -秃顶2019年5月28日16:48打印机名称:内饰:254mm × 178mm 2 D. Wegmann和C. Leuenberger强有力的哲学基础。然而，这两个学派在语言上略有不同。为了保持介绍的简洁和一致，我们首先从贝叶斯的角度介绍统计建模的基本概念。关于频率论观点的主要区别将在下面讨论。1.1.1.1独立性假设统计推断的第一步是指定一个统计模型，该模型包括识别所有相关变量，θ，并表示它们相互作用的联合概率分布P(，θ)，通常在一些简化的假设下这一步的重要性怎么估计都不为过:忽略一个变量就等于假定这个变量是独立的或有条件地独立于所考虑的所有变量。例如，通过关注数据的汇总统计或子集T()，就意味着T()包含中存在的关于θ的所有信息。同样，假定所有不包含在θ中的变量独立于以θ为条件的。第三种假设是认为特定的变量是有条件地相互独立的。这在一个参数的概率分布依赖于其他分层参数的值的分层模型中尤为重要。例1.1(等位基因频率)。我们努力通过有限数量的引人注目的场景来说明本章中的所有概念，我们经常重温这些场景。其中一个问题是从DNA序列数据推断双等位基因位点衍生等位基因的频率f。虽然f可以表示两个等位基因中的任何一个的频率，但我们在这里假设，在不丧失一般性的情况下，这两个等位基因可以极化为祖先等位基因和衍生等位基因，后者是通过突变由前者产生的。现在考虑DNA序列数据d = {d1，…， n个二倍体个体的序列错误率为ε。显然，如果已知所有基因型，f很容易计算出来。然而，使用一个统计模型，适当地解释基因分型的不确定性，一个层次参数，如f，可以估计从更少的数据(因此测序深度)比需要准确地推断所有n个基因型。以参数θ = {f， ε}和数据= d为参数的合适统计模型可能如下:P(d, f， ε) = P(d| f， ε)P(f， ε) = [n∏i=1∑gi P(di|gi， ε)P(gi| f)]P(f， ε)。(1.1)在这里，总和覆盖了未知基因型gi的所有可能值。例1.1中引入的模型强烈假设唯一相关变量是测序数据d，未知基因型g = {g1，…， gn}，测序错误率ε，等位基因频率f。此外，该模型还做出了条件独立假设P(di|gi， ε， f, d−i) = P(di|gi， ε)，即当条件作用于特定基因型gi时，个体i的测序数据di与f和所有其他个体d−i的测序数据独立。变量也可能变得有条件依赖，例如，一旦在上述模型中考虑特定数据，f和ε就会如此。不可否认，数据d约束了ε和f:例如，观察到大约5%的衍生等位基因，只有当ε≈0.05时才符合f = 0，而不符合1。为了简化符号，我们将用P(⋅)表示离散变量和连续变量的概率。此外，在描述一般概念时，我们通常会假设连续情况，因此使用积分而不是求和。JWST943-c01 JWST943-Balding 2019年5月28日16:48打印机名称:Trim: 254mm × 178mm 1遗传学统计建模与推断3 (a) (b)图1.1 (a)有向无环图(DAG)表示式(1.1)中给出的例1.1的独立性假设。观测数据以平方表示;未知变量为圆。(b)在印版符号中相同的DAG，其中印版复制内部数量的次数与印版中指定的次数相同(这里是n次)。ε要低得多。这也强调了一个模型中的统计依赖性在本质上并不意味着因果关系。事实上，等位基因频率不会对测序机的错误率产生因果影响，但在模型中，两个变量f和ε是相关的，因为它们通过数据d连接在一起。因此，重要的是，统计模型不是关于因果关系的陈述，而只是关于(条件)独立性假设。Barber(2012，第2章)对此进行了很好的讨论。使用所谓的有向无环图(DAG;理发师,2012;科勒和弗里德曼，2009)。在DAG中，每个变量xi都是一个节点，从有向边指向xi的任何变量xj都被认为是xi的父变量。图1.1给出了例1.1模型的DAG，从中可以很容易地读出模型的独立性假设:(1)假设DAG中的每个变量独立于DAG中未包含的任何变量。(2)假定每个变量不独立于它的父变量。例如，在我们的案例中，我们假设个体i的数据di不独立于基因型gi，也不独立于测序错误率ε。(3)当条件为x时，以a→x→b或a←x→b连接的每一对变量a, b是独立的。在我们的例子中，当条件为gi和ε时，所有di都独立于f和所有其他dj, j≠i。(4)如果变量a、b以a→x←b连接，则x称为对撞机;在此条件下，a和b相互依赖。在我们的例子中，ε和gi因此不是独立的，只要考虑特定的数据di。对于ε和f也是如此，除非我们对所有gi附加条件。在这一点上，让我们回忆一下，频率论者会用稍微不同的词汇来讨论上述概念。1.1.1.2概率分布一旦设定了独立性假设，就必须对概率分布做出明确的假设。我们注意到，这不是所谓的非参数统计方法的要求。然而，我们不会在这里考虑这些，因为大多数非参数方法要么局限于假设检验，要么只有在样本量非常大时才被证明是合理的，而遗传学中的许多问题必须用有限的数据来解决。相反，我们将专注于参数化建模，并假设观测值是由参数化概率分布P(|θ)生成的，参数θ未知，但函数P已知，因此需要指定。例1.2(等位基因频率)。对于式(1.1)中例1.1的模型，必须指定两个概率函数:P(di|gi， ε)和P(gi| f)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Statistical Modeling and Inference in Genetics

Given the long mathematical history and tradition in genetics, and particularly in population genetics, it is not surprising that model-based statistical inference has always been an integral part of statistical genetics, and vice versa. Since the big data revolution due to novel sequencing technologies, statistical genetics has further relied heavily on numerical methods for inference. In this chapter we give a brief overview over the foundations of statistical inference, including both the frequentist and Bayesian schools, and introduce analytical and numerical methods commonly applied in statistical genetics. A particular focus is put on recent approximate techniques that now play an important role in several fields of statistical genetics. We conclude by applying several of the algorithms introduced to hidden Markov models, which have been used very successfully to model processes along chromosomes. Throughout we strive for the impossible task of making the material accessible to readers with limited statistical background, while hoping that it will also constitute a worthy refresher for more advanced readers. Readers who already have a solid statistical background may safely skip the first introductory part and jump directly to Section 1.2.3. 1.1 Statistical Models and Inference Statistical inference offers a formal approach to characterizing a random phenomenon using observations, either by providing a description of a past phenomenon, or by giving some predictions about future phenomena of a similar nature. This is typically done by estimating a vector of parameters θ from on a vector of observations or data , using the formal framework and laws of probability. The interpretation of probabilities is a somewhat contentious issue, with multiple competing interpretations. Specifically, probabilities can be seen as the frequencies with which specific events occur in a repeatable experiment (the frequentist interpretation; Lehmann and Casella, 2006), or as reflecting the uncertainty or degree of belief about the state of a random variable (the Bayesian interpretation; Robert, 2007). In frequentist statistics, only  is thus considered as a random variable, while in Bayesian statistics both and θ are considered random variables. The goal of this chapter is not, however, to enter any debate about the validity of the two competing schools of thought. Instead, our aim is to introduce the most commonly used inference methods of both schools. Indeed, most researchers in statistical genetics, including ourselves, choose their approaches pragmatically based on computational considerations rather than Handbook of Statistical Genomics, Fourth Edition, Volume 1. Edited by David J. Balding, Ida Moltke and John Marioni. © 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. CO PY RI GH TE D M AT ER IA L JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 2 D. Wegmann and C. Leuenberger strong philosophical grounds. Yet, the two schools differ slightly in their language. To keep the introduction succinct and consistent, we introduce the basic concepts of statistical modeling first from the Bayesian point of view. The main differences with respect to the frequentist view are then discussed below. 1.1.1 Statistical Models 1.1.1.1 Independence Assumptions The first step in statistical inference is to specify a statistical model, which consists of identifying all relevant variables ,θ and formulating the joint probability distribution P(,θ) of their interaction, usually under some simplifying assumptions.1 It is hard to overestimate the importance of this step: ignoring a variable makes the strong assumption that this variable is independent or conditionally independent of all variables considered. By focusing on a summary statistic or subset T() of the data , for instance, it is implied that T() contains all information about θ present in . Similarly, all variables not included in θ are assumed to be independent of conditioned on θ. A third type of assumption that is often made is to consider specific variables to be conditionally independent of each other. That is particularly relevant in hierarchical models where the probability distribution of one parameter is dependent on the values of other hierarchical parameters. Example 1.1 (Allele frequencies). We strive to illustrate all concepts in this chapter through a limited number of compelling scenarios that we revisit frequently. One of these is the problem of inferring the frequency f of the derived allele at a bi-allelic locus from DNA sequence data. While f may denote the frequency of either of the two alleles, we will assume here, without loss of generality, that the two alleles can be polarized into the ancestral and derived allele, where the latter arose from the former through a mutation. Consider now DNA sequence data d = {d1,... , dn} obtained for n diploid individuals with sequencing errors at rate ε. Obviously, f could easily be calculated if all genotypes were known. However, using a statistical model that properly accounts for genotyping uncertainty, a hierarchical parameter such as f can be estimated from much less data (and hence sequencing depth) than would be necessary to accurately infer all n genotypes. An appropriate statistical model with parameters θ = { f , ε} and data  = d might look as follows: P(d, f , ε) = P(d| f , ε)P( f , ε) = [ n ∏ i=1 ∑ gi P(di|gi, ε)P( gi| f )]P( f , ε). (1.1) Here, the sum runs over all possible values of the unknown genotypes gi. The model introduced in Example 1.1 makes the strong assumptions that the only relevant variables are the sequencing data d, the unknown genotypes g = {g1,... , gn}, the sequencing error rate ε and the allele frequency f . In addition, the model makes the conditional independence assumptions P(di|gi, ε, f , d−i) = P(di|gi, ε) that the sequencing data di obtained for individual i is independent of f and the sequencing data of all other individuals d−i when conditioning on a particular genotype gi. Variables may also become conditionally dependent, as do, for instance, f and ε once specific data is considered in the above model. Undeniably, the data d constrains ε and f : observing around 5% of derived alleles, for instance, is only compatible with f = 0 if ε ≈ 0.05, but not 1 To keep the notation simple, we will denote by P(⋅) the probability of both discrete and continuous variables. Also, we will typically assume the continuous case when describing general concepts and thus use integrals instead of sums. JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 1 Statistical Modeling and Inference in Genetics 3 (a) (b) Figure 1.1 (a) Directed acyclic graph (DAG) representing the independence assumptions of Example 1.1 as given in equation (1.1). Observed data is shown as squares; unknown variables as circles. (b) The same DAG in plate notation, where a plate replicates the inside quantities as many times as specified in the plate (here n times). with a much lower ε. This also highlights that statistical dependence in a model never implies causality in nature. Indeed, the allele frequency does not causally affect the error rate of the sequencing machine, yet in the model the two variables f and ε are dependent as they are connected through the data d. Importantly, therefore, a statistical model is not a statement about causality, but only about (conditional) independence assumptions. An excellent discussion on this is given in Barber (2012, Ch. 2). It is often helpful to illustrate the specific independence assumptions of a statistical model graphically using a so-called directed acyclic graph (DAG; Barber, 2012; Koller and Friedman, 2009). In a DAG, each variable xi is a node, and any variable xj from which a directed edge points to xi is considered a parental variable of xi. A DAG for the model of Example 1.1 is given in Figure 1.1, from which the independence assumptions of the model are easily read: (1) Each variable in the DAG is assumed to be independent of any variable not included in the DAG. (2) Each variable is assumed not to be independent of its parental variables. In our case, for instance, we assume that the data di of individual i is not independent of the genotype gi, nor the sequencing error rate ε. (3) Each pair of variables a, b connected as a → x → b or a ← x → b is independent when conditioning on x. In our example, all di are independent of f and all other dj, j ≠ i, when conditioning on gi and ε. (4) If variables a, b are connected as a → x ← b, x is called a collider; conditioning on it, a and b become dependent. In our example, ε and gi are thus not independent as soon as specific data di is considered. The same holds for ε and f , unless we additionally condition on all gi. Let us recall at this point that a frequentist would discuss the above concepts with a slightly different vocabulary. 1.1.1.2 Probability Distributions Once independence assumptions are set, explicit assumptions on the probability distributions have to be made. We note that this is not a requirement for so-called nonparametric statistical approaches. However, we will not consider these here because most nonparametric approaches are either restricted to hypothesis testing or only justified when sample sizes are very large, while many problems in genetics have to be solved with limited data. Instead, we will focus on parametric modeling and assume that the observations  were generated from parameterized probability distributions P(|θ) with unknown parameters θ, but known function P, which thus need to be specified. Example 1.2 (Allele frequency). For the model of Example 1.1 given in equation (1.1), two probability functions have to be specified: P(di|gi, ε) and P( gi| f ). For the latter, we might be willing to assume that genotypes are in Hardy–Weinberg equilibrium (Hardy, 1908; Weinberg, 1908), such that P( g| f ) = 2g ) f

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Handbook of Statistical Genomics

自引率

0.00%

发文量