“高信息量”的遗传标记可能导致结论偏差：示例和一般解决方案。

IF 5.5 1区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Molecular Ecology Resources Pub Date : 2025-07-11 DOI:10.1111/1755-0998.70011

Andy Lee, William Hemstrom, Natalie Molea, Gordon Luikart, Mark R. Christie

{"title":"“高信息量”的遗传标记可能导致结论偏差：示例和一般解决方案。","authors":"Andy Lee, William Hemstrom, Natalie Molea, Gordon Luikart, Mark R. Christie","doi":"10.1111/1755-0998.70011","DOIUrl":null,"url":null,"abstract":"High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-FST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-FST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined ‘populations’. Furthermore, we caution that high-grading is not limited to FST approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high FST loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary FST cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":"25 7","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.70011","citationCount":"0","resultStr":"{\"title\":\"‘Highly-Informative’ Genetic Markers Can Bias Conclusions: Examples and General Solutions\",\"authors\":\"Andy Lee, William Hemstrom, Natalie Molea, Gordon Luikart, Mark R. Christie\",\"doi\":\"10.1111/1755-0998.70011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-FST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-FST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined ‘populations’. Furthermore, we caution that high-grading is not limited to FST approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high FST loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary FST cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).\",\"PeriodicalId\":211,\"journal\":{\"name\":\"Molecular Ecology Resources\",\"volume\":\"25 7\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.70011\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Ecology Resources\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.70011\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.70011","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

高分级偏倚是指由于模型过拟合而导致的位点子集的高估能力。使用经验和模拟数据集，我们表明，当选择（即确定）高信息或高fst标记并用于后续评估时，高分级偏差可能导致对群体结构的严重高估，从而误导研究者，这是群体遗传研究中的一种常见做法。这个问题可能发生在没有适应当地环境的流感人群中。选择高fst标记的有偏差的结果可能会对管理和保护产生严重的下游影响，例如错误的保护单元划定，这可能会浪费有限的保护资源来保护错误定义的“种群”。此外，我们警告说，高分级并不局限于FST方法；每当首先选择一小部分标记来解释基于差异程度的群体之间的差异，然后再利用来估计这些群体之间的差异程度时，高分级偏差就会引起关注。例如，选择高FST位点用于GT-seq面板或使用差异表达基因来绘制多变量空间中的样本隶属度，都可能导致不存在的虚假结构。我们说明，使用基于统计的离群值检验来代替任意的FST截止值可以减少偏差。另外，排列试验或交叉评价可用于检测高分级偏倚。我们提供了一个R软件包PCAssess，通过自动化排列测试和主成分分析来帮助研究人员检测和防止遗传数据集中的高分级偏差（https://github.com/hemstrow/PCAssess）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

‘Highly-Informative’ Genetic Markers Can Bias Conclusions: Examples and General Solutions

查看原文本刊更多论文

‘Highly-Informative’ Genetic Markers Can Bias Conclusions: Examples and General Solutions

High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-F_ST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-F_ST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined ‘populations’. Furthermore, we caution that high-grading is not limited to F_ST approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high F_ST loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary F_ST cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Molecular Ecology Resources 生物-进化生物学

CiteScore

15.60

自引率

5.20%

发文量

170

审稿时长

3 months

期刊介绍： Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines. In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.