{"title":"用熵统计量化基因组数据中伪复制的潜在优势和挑战","authors":"Eric J Ward, Robin S Waples","doi":"10.3390/e26090805","DOIUrl":null,"url":null,"abstract":"<p><p>Generating vast arrays of genetic markers for evolutionary ecology studies has become routine and cost-effective. However, analyzing data from large numbers of loci associated with a small number of finite chromosomes introduces a challenge: loci on the same chromosome do not assort independently, leading to pseudoreplication. Previous studies have demonstrated that pseudoreplication can substantially reduce precision of genetic analyses (and make confidence intervals wider), such as <i>F<sub>ST</sub></i> and linkage disequilibrium (LD) measures between pairs of loci. In LD analyses, another type of dependency (overlapping pairs of the same loci) also creates pseudoreplication. Building on previous work, we explore the potential of entropy metrics to improve the status quo, particularly total correlation (TC), to assess pseudoreplication in LD studies. Our simulations, performed on a monoecious population with a range of effective population sizes (<i>N<sub>e</sub></i>) and numbers of loci, attempted to isolate the overlapping-pairs-of-loci effect by considering unlinked loci and using entropy to quantify inter-locus relationships. We hypothesized a positive correlation between TC and the number of loci (L), and a negative correlation between TC and <i>N<sub>e</sub></i>. Results from our statistical models predicting TC demonstrate a strong effect of the number of loci, and muted effects of <i>N<sub>e</sub></i> and other predictors, adding support to the use of entropy-based metrics as a tool for estimating the statistical information of complex genetic datasets. Our results also highlight a challenge regarding scalability; computational limitations arise as the number of loci grows, making our current approach limited to smaller datasets. Despite these challenges, this work further refines our understanding of entropy measures, and offers insights into the complex dynamics of genetic information in evolutionary ecology research.</p>","PeriodicalId":11694,"journal":{"name":"Entropy","volume":"26 9","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11431677/pdf/","citationCount":"0","resultStr":"{\"title\":\"Potential Benefits and Challenges of Quantifying Pseudoreplication in Genomic Data with Entropy Statistics.\",\"authors\":\"Eric J Ward, Robin S Waples\",\"doi\":\"10.3390/e26090805\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Generating vast arrays of genetic markers for evolutionary ecology studies has become routine and cost-effective. However, analyzing data from large numbers of loci associated with a small number of finite chromosomes introduces a challenge: loci on the same chromosome do not assort independently, leading to pseudoreplication. Previous studies have demonstrated that pseudoreplication can substantially reduce precision of genetic analyses (and make confidence intervals wider), such as <i>F<sub>ST</sub></i> and linkage disequilibrium (LD) measures between pairs of loci. In LD analyses, another type of dependency (overlapping pairs of the same loci) also creates pseudoreplication. Building on previous work, we explore the potential of entropy metrics to improve the status quo, particularly total correlation (TC), to assess pseudoreplication in LD studies. Our simulations, performed on a monoecious population with a range of effective population sizes (<i>N<sub>e</sub></i>) and numbers of loci, attempted to isolate the overlapping-pairs-of-loci effect by considering unlinked loci and using entropy to quantify inter-locus relationships. We hypothesized a positive correlation between TC and the number of loci (L), and a negative correlation between TC and <i>N<sub>e</sub></i>. Results from our statistical models predicting TC demonstrate a strong effect of the number of loci, and muted effects of <i>N<sub>e</sub></i> and other predictors, adding support to the use of entropy-based metrics as a tool for estimating the statistical information of complex genetic datasets. Our results also highlight a challenge regarding scalability; computational limitations arise as the number of loci grows, making our current approach limited to smaller datasets. Despite these challenges, this work further refines our understanding of entropy measures, and offers insights into the complex dynamics of genetic information in evolutionary ecology research.</p>\",\"PeriodicalId\":11694,\"journal\":{\"name\":\"Entropy\",\"volume\":\"26 9\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11431677/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Entropy\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.3390/e26090805\",\"RegionNum\":3,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PHYSICS, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Entropy","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.3390/e26090805","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
摘要
为进化生态学研究生成庞大的遗传标记阵列已成为常规且具有成本效益的方法。然而,分析来自与少量有限染色体相关的大量基因座的数据会带来一个挑战:同一染色体上的基因座不会独立同化,从而导致假重复。以往的研究表明,假重复会大大降低遗传分析的精确度(并使置信区间变宽),例如位点对之间的 FST 和连锁不平衡(LD)测量。在 LD 分析中,另一种依赖关系(相同位点对的重叠)也会产生假重复。在之前工作的基础上,我们探索了熵指标改进现状的潜力,特别是总相关性(TC),以评估 LD 研究中的假重复。我们的模拟是在一个雌雄同株的种群中进行的,该种群具有一定范围的有效种群大小(Ne)和基因座数量,我们试图通过考虑非连锁基因座并使用熵来量化基因座间的关系,从而分离出基因对的重叠效应。我们假设 TC 与基因位点数(L)呈正相关,TC 与 Ne 呈负相关。我们的统计模型预测 TC 的结果表明,基因位点数的影响很大,而 Ne 和其他预测因子的影响较小,这为使用基于熵的指标作为估算复杂遗传数据集统计信息的工具提供了支持。我们的研究结果还凸显了可扩展性方面的挑战;随着基因座数量的增加,计算能力也会受到限制,这使得我们目前的方法仅限于较小的数据集。尽管存在这些挑战,这项工作还是进一步完善了我们对熵度量的理解,并为进化生态学研究中遗传信息的复杂动态提供了见解。
Potential Benefits and Challenges of Quantifying Pseudoreplication in Genomic Data with Entropy Statistics.
Generating vast arrays of genetic markers for evolutionary ecology studies has become routine and cost-effective. However, analyzing data from large numbers of loci associated with a small number of finite chromosomes introduces a challenge: loci on the same chromosome do not assort independently, leading to pseudoreplication. Previous studies have demonstrated that pseudoreplication can substantially reduce precision of genetic analyses (and make confidence intervals wider), such as FST and linkage disequilibrium (LD) measures between pairs of loci. In LD analyses, another type of dependency (overlapping pairs of the same loci) also creates pseudoreplication. Building on previous work, we explore the potential of entropy metrics to improve the status quo, particularly total correlation (TC), to assess pseudoreplication in LD studies. Our simulations, performed on a monoecious population with a range of effective population sizes (Ne) and numbers of loci, attempted to isolate the overlapping-pairs-of-loci effect by considering unlinked loci and using entropy to quantify inter-locus relationships. We hypothesized a positive correlation between TC and the number of loci (L), and a negative correlation between TC and Ne. Results from our statistical models predicting TC demonstrate a strong effect of the number of loci, and muted effects of Ne and other predictors, adding support to the use of entropy-based metrics as a tool for estimating the statistical information of complex genetic datasets. Our results also highlight a challenge regarding scalability; computational limitations arise as the number of loci grows, making our current approach limited to smaller datasets. Despite these challenges, this work further refines our understanding of entropy measures, and offers insights into the complex dynamics of genetic information in evolutionary ecology research.
期刊介绍:
Entropy (ISSN 1099-4300), an international and interdisciplinary journal of entropy and information studies, publishes reviews, regular research papers and short notes. Our aim is to encourage scientists to publish as much as possible their theoretical and experimental details. There is no restriction on the length of the papers. If there are computation and the experiment, the details must be provided so that the results can be reproduced.