一个大群体中样本家谱突变数的中心极限定理。

IF 1.2 4区 生物学 Q4 ECOLOGY
Yun-Xin Fu
{"title":"一个大群体中样本家谱突变数的中心极限定理。","authors":"Yun-Xin Fu","doi":"10.1016/j.tpb.2025.02.001","DOIUrl":null,"url":null,"abstract":"<div><div>The number <span><math><mi>K</mi></math></span> of mutations identifiable in a sample of <span><math><mi>n</mi></math></span> sequences from a large population is one of the most important summary statistics in population genetics and is ubiquitous in the analysis of DNA sequence data. <span><math><mi>K</mi></math></span> can be expressed as the sum of <span><math><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></math></span> independent geometric random variables. Consequently, its probability generating function was established long ago, yielding its well-known expectation and variance. However, the statistical properties of <span><math><mi>K</mi></math></span> is much less understood than those of the number of distinct alleles in a sample. This paper demonstrates that the central limit theorem holds for <span><math><mi>K</mi></math></span>, implying that <span><math><mi>K</mi></math></span> follows approximately a normal distribution when a large sample is drawn from a population evolving according to the Wright-Fisher model with a constant effective size, or according to the constant-in-state model, which allows population sizes to vary independently but bounded uniformly across different states of the coalescent process. Additionally, the skewness and kurtosis of <span><math><mi>K</mi></math></span> are derived, confirming that <span><math><mi>K</mi></math></span> has asymptotically the same skewness and kurtosis as a normal distribution. Furthermore, skewness converges at speed <span><math><mrow><mn>1</mn><mo>/</mo><msqrt><mrow><mo>ln</mo><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow></msqrt></mrow></math></span> and while kurtosis at speed <span><math><mrow><mn>1</mn><mo>/</mo><mo>ln</mo><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span>. Despite the overall convergence speed to normality is relatively slow, the distribution of <span><math><mi>K</mi></math></span> for a modest sample size is already not too far from normality, therefore the asymptotic normality may be sufficient for certain applications when the sample size is large enough.</div></div>","PeriodicalId":49437,"journal":{"name":"Theoretical Population Biology","volume":"162 ","pages":"Pages 22-28"},"PeriodicalIF":1.2000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The central limit theorem for the number of mutations in the genealogy of a sample from a large population\",\"authors\":\"Yun-Xin Fu\",\"doi\":\"10.1016/j.tpb.2025.02.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The number <span><math><mi>K</mi></math></span> of mutations identifiable in a sample of <span><math><mi>n</mi></math></span> sequences from a large population is one of the most important summary statistics in population genetics and is ubiquitous in the analysis of DNA sequence data. <span><math><mi>K</mi></math></span> can be expressed as the sum of <span><math><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></math></span> independent geometric random variables. Consequently, its probability generating function was established long ago, yielding its well-known expectation and variance. However, the statistical properties of <span><math><mi>K</mi></math></span> is much less understood than those of the number of distinct alleles in a sample. This paper demonstrates that the central limit theorem holds for <span><math><mi>K</mi></math></span>, implying that <span><math><mi>K</mi></math></span> follows approximately a normal distribution when a large sample is drawn from a population evolving according to the Wright-Fisher model with a constant effective size, or according to the constant-in-state model, which allows population sizes to vary independently but bounded uniformly across different states of the coalescent process. Additionally, the skewness and kurtosis of <span><math><mi>K</mi></math></span> are derived, confirming that <span><math><mi>K</mi></math></span> has asymptotically the same skewness and kurtosis as a normal distribution. Furthermore, skewness converges at speed <span><math><mrow><mn>1</mn><mo>/</mo><msqrt><mrow><mo>ln</mo><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow></msqrt></mrow></math></span> and while kurtosis at speed <span><math><mrow><mn>1</mn><mo>/</mo><mo>ln</mo><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow></mrow></math></span>. Despite the overall convergence speed to normality is relatively slow, the distribution of <span><math><mi>K</mi></math></span> for a modest sample size is already not too far from normality, therefore the asymptotic normality may be sufficient for certain applications when the sample size is large enough.</div></div>\",\"PeriodicalId\":49437,\"journal\":{\"name\":\"Theoretical Population Biology\",\"volume\":\"162 \",\"pages\":\"Pages 22-28\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2025-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Theoretical Population Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0040580925000097\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Population Biology","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0040580925000097","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

在一个大群体的n个序列样本中可识别的K个突变数是群体遗传学中最重要的汇总统计之一,在DNA序列数据分析中普遍存在。K可以表示为n-1个独立几何随机变量的和。因此,它的概率生成函数很早就建立起来了,从而产生了众所周知的期望和方差。然而,与样本中不同等位基因的数量相比,对K的统计特性的了解要少得多。本文证明了K的中心极限定理是成立的,这意味着当从根据Wright-Fisher模型(有效大小恒定)或根据恒态模型(允许总体大小独立变化,但在凝聚过程的不同状态之间有统一界限)进化的总体中抽取一个大样本时,K近似服从正态分布。此外,导出了K的偏度和峰度,证实了K具有与正态分布渐近相同的偏度和峰度。此外,偏度以1/ln(n)速度收敛,而峰度以1/ln(n)速度收敛。尽管总体收敛到正态的速度相对较慢,但适度样本量下K的分布已经离正态不远了,因此当样本量足够大时,对于某些应用来说,渐近正态可能是足够的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The central limit theorem for the number of mutations in the genealogy of a sample from a large population
The number K of mutations identifiable in a sample of n sequences from a large population is one of the most important summary statistics in population genetics and is ubiquitous in the analysis of DNA sequence data. K can be expressed as the sum of n1 independent geometric random variables. Consequently, its probability generating function was established long ago, yielding its well-known expectation and variance. However, the statistical properties of K is much less understood than those of the number of distinct alleles in a sample. This paper demonstrates that the central limit theorem holds for K, implying that K follows approximately a normal distribution when a large sample is drawn from a population evolving according to the Wright-Fisher model with a constant effective size, or according to the constant-in-state model, which allows population sizes to vary independently but bounded uniformly across different states of the coalescent process. Additionally, the skewness and kurtosis of K are derived, confirming that K has asymptotically the same skewness and kurtosis as a normal distribution. Furthermore, skewness converges at speed 1/ln(n) and while kurtosis at speed 1/ln(n). Despite the overall convergence speed to normality is relatively slow, the distribution of K for a modest sample size is already not too far from normality, therefore the asymptotic normality may be sufficient for certain applications when the sample size is large enough.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Theoretical Population Biology
Theoretical Population Biology 生物-进化生物学
CiteScore
2.50
自引率
14.30%
发文量
43
审稿时长
6-12 weeks
期刊介绍: An interdisciplinary journal, Theoretical Population Biology presents articles on theoretical aspects of the biology of populations, particularly in the areas of demography, ecology, epidemiology, evolution, and genetics. Emphasis is on the development of mathematical theory and models that enhance the understanding of biological phenomena. Articles highlight the motivation and significance of the work for advancing progress in biology, relying on a substantial mathematical effort to obtain biological insight. The journal also presents empirical results and computational and statistical methods directly impinging on theoretical problems in population biology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信