Investigating the Performance of Frequentist and Bayesian Techniques in Genomic Evaluation.

IF 2.1 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Biochemical Genetics Pub Date : 2024-07-01 DOI:10.1007/s10528-024-10842-1

Hamid Sahebalam, Mohsen Gholizadeh, Hasan Hafezian

{"title":"Investigating the Performance of Frequentist and Bayesian Techniques in Genomic Evaluation.","authors":"Hamid Sahebalam, Mohsen Gholizadeh, Hasan Hafezian","doi":"10.1007/s10528-024-10842-1","DOIUrl":null,"url":null,"abstract":"<p><p>The genomic evaluation process relies on the assumption of linkage disequilibrium between dense single-nucleotide polymorphism (SNP) markers at the genome level and quantitative trait loci (QTL). The present study was conducted with the aim of evaluating four frequentist methods including Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Genomic Best Linear Unbiased Prediction (GBLUP) and five Bayesian methods including Bayes Ridge Regression (BRR), Bayes A, Bayesian LASSO, Bayes C, and Bayes B, in genomic selection using simulation data. The difference between prediction accuracy was assessed in pairs based on statistical significance (p-value) (i.e., t test and Mann-Whitney U test) and practical significance (Cohen's d effect size) For this purpose, the data were simulated based on two scenarios in different marker densities (4000 and 8000, in the whole genome). The simulated data included a genome with four chromosomes, 1 Morgan each, on which 100 randomly distributed QTL and two different densities of evenly distributed SNPs (1000 and 2000), at the heritability level of 0.4, was considered. For the frequentist methods except for GBLUP, the regularization parameter λ was calculated using a five-fold cross-validation approach. For both scenarios, among the frequentist methods, the highest prediction accuracy was observed by Ridge Regression and GBLUP. The lowest and the highest bias were shown by Ridge Regression and GBLUP, respectively. Also, among the Bayesian methods, Bayes B and BRR showed the highest and lowest prediction accuracy, respectively. The lowest bias in both scenarios was registered by Bayesian LASSO and the highest bias in the first and the second scenario were shown by BRR and Bayes B, respectively. Across all the studied methods in both scenarios, the highest and the lowest accuracy were shown by Bayes B and LASSO and Elastic Net, respectively. As expected, the greatest similarity in performance was observed between GBLUP and BRR ( <math><mrow><mi>d</mi> <mo>=</mo> <mn>0.007</mn></mrow> </math> , in the first scenario and <math><mrow><mi>d</mi> <mo>=</mo> <mn>0.003</mn></mrow> </math> , in the second scenario). The results obtained from parametric t and non-parametric Mann-Whitney U tests were similar. In the first and second scenario, out of 36 t test between the performance of the studied methods in each scenario, 14 ( <math><mrow><mi>P</mi> <mo><</mo> <mo>.</mo> <mn>001</mn></mrow> </math> ) and 2 ( <math><mrow><mi>P</mi> <mo><</mo> <mo>.</mo> <mn>05</mn></mrow> </math> ) comparisons were significant, respectively, which indicates that with the increase in the number of predictors, the difference in the performance of different methods decreases. This was proven based on the Cohen's d effect size, so that with the increase in the complexity of the model, the effect size was not seen as very large. The regularization parameters in frequentist methods should be optimized by cross-validation approach before using these methods in genomic evaluation.</p>","PeriodicalId":482,"journal":{"name":"Biochemical Genetics","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biochemical Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s10528-024-10842-1","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The genomic evaluation process relies on the assumption of linkage disequilibrium between dense single-nucleotide polymorphism (SNP) markers at the genome level and quantitative trait loci (QTL). The present study was conducted with the aim of evaluating four frequentist methods including Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Genomic Best Linear Unbiased Prediction (GBLUP) and five Bayesian methods including Bayes Ridge Regression (BRR), Bayes A, Bayesian LASSO, Bayes C, and Bayes B, in genomic selection using simulation data. The difference between prediction accuracy was assessed in pairs based on statistical significance (p-value) (i.e., t test and Mann-Whitney U test) and practical significance (Cohen's d effect size) For this purpose, the data were simulated based on two scenarios in different marker densities (4000 and 8000, in the whole genome). The simulated data included a genome with four chromosomes, 1 Morgan each, on which 100 randomly distributed QTL and two different densities of evenly distributed SNPs (1000 and 2000), at the heritability level of 0.4, was considered. For the frequentist methods except for GBLUP, the regularization parameter λ was calculated using a five-fold cross-validation approach. For both scenarios, among the frequentist methods, the highest prediction accuracy was observed by Ridge Regression and GBLUP. The lowest and the highest bias were shown by Ridge Regression and GBLUP, respectively. Also, among the Bayesian methods, Bayes B and BRR showed the highest and lowest prediction accuracy, respectively. The lowest bias in both scenarios was registered by Bayesian LASSO and the highest bias in the first and the second scenario were shown by BRR and Bayes B, respectively. Across all the studied methods in both scenarios, the highest and the lowest accuracy were shown by Bayes B and LASSO and Elastic Net, respectively. As expected, the greatest similarity in performance was observed between GBLUP and BRR ( $d = 0.007$ , in the first scenario and $d = 0.003$ , in the second scenario). The results obtained from parametric t and non-parametric Mann-Whitney U tests were similar. In the first and second scenario, out of 36 t test between the performance of the studied methods in each scenario, 14 ( $P < . 001$ ) and 2 ( $P < . 05$ ) comparisons were significant, respectively, which indicates that with the increase in the number of predictors, the difference in the performance of different methods decreases. This was proven based on the Cohen's d effect size, so that with the increase in the complexity of the model, the effect size was not seen as very large. The regularization parameters in frequentist methods should be optimized by cross-validation approach before using these methods in genomic evaluation.

Abstract Image

查看原文本刊更多论文

调查基因组评估中频数主义和贝叶斯技术的性能。

基因组评估过程依赖于在基因组水平上密集的单核苷酸多态性（SNP）标记与数量性状位点（QTL）之间存在连锁不平衡的假设。本研究旨在利用模拟数据评估基因组选择中的四种频繁主义方法（包括岭回归、最小绝对收缩和选择操作器（LASSO）、弹性网和基因组最佳线性无偏预测（GBLUP））和五种贝叶斯方法（包括贝叶斯岭回归（BRR）、贝叶斯A、贝叶斯LASSO、贝叶斯C和贝叶斯B）。根据统计显著性（p 值）（即 t 检验和 Mann-Whitney U 检验）和实际显著性（Cohen's d效应大小），成对评估了预测准确性之间的差异。模拟数据包括一个有 4 条染色体的基因组，每条染色体上有 1 个摩尔根，其中有 100 个随机分布的 QTL 和两种不同密度的均匀分布的 SNP（1000 和 2000），遗传率水平为 0.4。除 GBLUP 外，其他频数法的正则化参数 λ 均采用五倍交叉验证方法计算。在这两种情况下，在频繁主义方法中，岭回归和 GBLUP 的预测准确率最高。偏差最小和最大的分别是岭回归和 GBLUP。同样，在贝叶斯方法中，Bayes B 和 BRR 的预测准确率分别最高和最低。贝叶斯 LASSO 在两种情况下的偏差最小，BRR 和 Bayes B 分别在第一种和第二种情况下偏差最大。在两种场景下的所有研究方法中，准确度最高和最低的分别是贝叶斯 B 和 LASSO 以及 Elastic Net。不出所料，GBLUP 和 BRR 的性能相似度最高（第一种情况下，d = 0.007；第二种情况下，d = 0.003）。参数 t 检验和非参数 Mann-Whitney U 检验的结果相似。在第一种和第二种方案中，所研究方法在每种方案中的性能之间进行的 36 次 t 检验中，分别有 14 次 ( P . 001 ) 和 2 次 ( P . 05 ) 比较具有显著性，这表明随着预测因子数量的增加，不同方法的性能差异会减小。根据 Cohen's d 效应量可以证明这一点，因此，随着模型复杂度的增加，效应量并不很大。在基因组评估中使用频繁法之前，应通过交叉验证方法优化频繁法的正则化参数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biochemical Genetics 生物-生化与分子生物学

CiteScore

3.90

自引率

0.00%

发文量

133

审稿时长

4.8 months

期刊介绍： Biochemical Genetics welcomes original manuscripts that address and test clear scientific hypotheses, are directed to a broad scientific audience, and clearly contribute to the advancement of the field through the use of sound sampling or experimental design, reliable analytical methodologies and robust statistical analyses. Although studies focusing on particular regions and target organisms are welcome, it is not the journal’s goal to publish essentially descriptive studies that provide results with narrow applicability, or are based on very small samples or pseudoreplication. Rather, Biochemical Genetics welcomes review articles that go beyond summarizing previous publications and create added value through the systematic analysis and critique of the current state of knowledge or by conducting meta-analyses. Methodological articles are also within the scope of Biological Genetics, particularly when new laboratory techniques or computational approaches are fully described and thoroughly compared with the existing benchmark methods. Biochemical Genetics welcomes articles on the following topics: Genomics; Proteomics; Population genetics; Phylogenetics; Metagenomics; Microbial genetics; Genetics and evolution of wild and cultivated plants; Animal genetics and evolution; Human genetics and evolution; Genetic disorders; Genetic markers of diseases; Gene technology and therapy; Experimental and analytical methods; Statistical and computational methods.