确定贝叶斯LASSO正则化参数的不同方法对基因组预测精度的影响。

IF 2.7 4区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY

Mammalian Genome Pub Date : 2025-03-01 Epub Date: 2024-12-11 DOI:10.1007/s00335-024-10088-7

Hamid Sahebalam, Mohsen Gholizadeh, Seyed Hassan Hafezian

{"title":"确定贝叶斯LASSO正则化参数的不同方法对基因组预测精度的影响。","authors":"Hamid Sahebalam, Mohsen Gholizadeh, Seyed Hassan Hafezian","doi":"10.1007/s00335-024-10088-7","DOIUrl":null,"url":null,"abstract":"Using dense genomic markers opens up new opportunities and challenges for breeding programs. The need to penalize marker-specific regression coefficients becomes particularly important when dense markers are available. Therefore, fitting the marker effects to observations using a regularization technique, such as Bayesian LASSO (BL) regression, is of great interesting. When the Laplace prior distribution is applied to the regression coefficients, BL can be interpreted as a regularization of the <math><mrow><mspace></mspace> <mi>L</mi> <mn>1</mn></mrow> </math> norm based on the Bayesian approach. A critical issue is the appropriate selection of hyperparameters values in the prior distributions of regularization techniques, as these values essentially control the sparsity in the estimated model. The purpose of this study was to evaluate different approaches for selecting the regularization parameter in BL, based on fully Bayesian approaches-such as gamma prior (BL_Gamma), beta prior (BL_Beta) and fixed prior (BL_Fixed) as well as data-driven approaches like cross-validation based on mean square error (BL_CV_MSE) and prediction accuracy (BL_CV_PA). Additionally, information-criteria-based methods including Akaike's information criterion (BL_AIC), Bayesian information criterion (BL_BIC) and Deviance information criterion (BL_DIC), were explored. For this purpose, a genome containing eight chromosomes (each 1 Morgan in length) with 100 randomly distributed quantitative trait loci was simulated. The studied scenarios were as follows: Scenario 1 involved 4000 markers and heritability of 0.2, scenario 2 involved 4000 markers and heritability of 0.6, scenario 3 involved 16,000 markers and heritability of 0.2; and scenario 4 involved 16,000 markers and heritability of 0.6. The results showed that among the fully Bayesian and cross-validation approaches, BL_Gamma, BL_Beta, and BL_CV_MSE provided the highest prediction accuracy (PA) in scenario 1 and 3. With increased marker density and heritability (scenario 4), the cross-validation approaches performed slightly better. The information-criteria-based methods demonstrated the lowest PA. Increasing heritability and marker density led to a decrease and an increase in the model penalty on the regression coefficients, respectively. The PA obtained in the target population ranged from 0.210 to 0.413 in Scenario 1, 0.402 to 0.600 in Scenario 2, 0.256 to 0.442 in Scenario 3, and 0.478 to 0.653 in Scenario 4. In generally, fully Bayesian approaches based on random priors for the regularization parameter are recommended for BL, as they provide acceptable PA with lower computational loads.","PeriodicalId":18259,"journal":{"name":"Mammalian Genome","volume":" ","pages":"331-345"},"PeriodicalIF":2.7000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The effect of different approaches to determining the regularization parameter of bayesian LASSO on the accuracy of genomic prediction.\",\"authors\":\"Hamid Sahebalam, Mohsen Gholizadeh, Seyed Hassan Hafezian\",\"doi\":\"10.1007/s00335-024-10088-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Using dense genomic markers opens up new opportunities and challenges for breeding programs. The need to penalize marker-specific regression coefficients becomes particularly important when dense markers are available. Therefore, fitting the marker effects to observations using a regularization technique, such as Bayesian LASSO (BL) regression, is of great interesting. When the Laplace prior distribution is applied to the regression coefficients, BL can be interpreted as a regularization of the <math><mrow><mspace></mspace> <mi>L</mi> <mn>1</mn></mrow> </math> norm based on the Bayesian approach. A critical issue is the appropriate selection of hyperparameters values in the prior distributions of regularization techniques, as these values essentially control the sparsity in the estimated model. The purpose of this study was to evaluate different approaches for selecting the regularization parameter in BL, based on fully Bayesian approaches-such as gamma prior (BL_Gamma), beta prior (BL_Beta) and fixed prior (BL_Fixed) as well as data-driven approaches like cross-validation based on mean square error (BL_CV_MSE) and prediction accuracy (BL_CV_PA). Additionally, information-criteria-based methods including Akaike's information criterion (BL_AIC), Bayesian information criterion (BL_BIC) and Deviance information criterion (BL_DIC), were explored. For this purpose, a genome containing eight chromosomes (each 1 Morgan in length) with 100 randomly distributed quantitative trait loci was simulated. The studied scenarios were as follows: Scenario 1 involved 4000 markers and heritability of 0.2, scenario 2 involved 4000 markers and heritability of 0.6, scenario 3 involved 16,000 markers and heritability of 0.2; and scenario 4 involved 16,000 markers and heritability of 0.6. The results showed that among the fully Bayesian and cross-validation approaches, BL_Gamma, BL_Beta, and BL_CV_MSE provided the highest prediction accuracy (PA) in scenario 1 and 3. With increased marker density and heritability (scenario 4), the cross-validation approaches performed slightly better. The information-criteria-based methods demonstrated the lowest PA. Increasing heritability and marker density led to a decrease and an increase in the model penalty on the regression coefficients, respectively. The PA obtained in the target population ranged from 0.210 to 0.413 in Scenario 1, 0.402 to 0.600 in Scenario 2, 0.256 to 0.442 in Scenario 3, and 0.478 to 0.653 in Scenario 4. In generally, fully Bayesian approaches based on random priors for the regularization parameter are recommended for BL, as they provide acceptable PA with lower computational loads.\",\"PeriodicalId\":18259,\"journal\":{\"name\":\"Mammalian Genome\",\"volume\":\" \",\"pages\":\"331-345\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mammalian Genome\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s00335-024-10088-7\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mammalian Genome","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00335-024-10088-7","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

使用密集基因组标记为育种计划带来了新的机遇和挑战。当密集标记可用时，惩罚特定标记的回归系数的需要变得特别重要。因此，使用正则化技术，如贝叶斯LASSO （BL）回归，拟合观察到的标记效应是非常有趣的。当将拉普拉斯先验分布应用于回归系数时，BL可以解释为基于贝叶斯方法的L - 1范数的正则化。一个关键问题是在正则化技术的先验分布中适当选择超参数值，因为这些值基本上控制了估计模型的稀疏性。本研究的目的是评估基于全贝叶斯方法（如gamma先验（BL_Gamma）， beta先验（BL_Beta）和固定先验（BL_Fixed））以及数据驱动方法（如基于均方误差（BL_CV_MSE）和预测精度（BL_CV_PA）的交叉验证）选择BL正则化参数的不同方法。此外，还探索了基于信息准则的赤池信息准则（BL_AIC）、贝叶斯信息准则（BL_BIC）和偏差信息准则（BL_DIC）。为此，模拟了包含8条染色体（每条长度为1 Morgan）和100个随机分布的数量性状位点的基因组。研究情景1为4000个标记，遗传率为0.2；情景2为4000个标记，遗传率为0.6；情景3为1.6万个标记，遗传率为0.2；场景4涉及16000个标记，遗传率为0.6。结果表明，在完全贝叶斯和交叉验证方法中，BL_Gamma、BL_Beta和BL_CV_MSE在场景1和场景3的预测精度最高。随着标记密度和遗传力的增加（场景4），交叉验证方法的表现略好。基于信息标准的方法显示最低的PA。遗传力和标记密度的增加分别导致回归系数的模型惩罚减小和增加。在情景1、情景2、情景3和情景4中，目标人群的PA分别为0.210至0.413、0.402至0.600、0.256至0.442和0.478至0.653。一般来说，基于正则化参数随机先验的全贝叶斯方法被推荐用于BL，因为它们提供了可接受的PA和更低的计算负荷。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The effect of different approaches to determining the regularization parameter of bayesian LASSO on the accuracy of genomic prediction.

Using dense genomic markers opens up new opportunities and challenges for breeding programs. The need to penalize marker-specific regression coefficients becomes particularly important when dense markers are available. Therefore, fitting the marker effects to observations using a regularization technique, such as Bayesian LASSO (BL) regression, is of great interesting. When the Laplace prior distribution is applied to the regression coefficients, BL can be interpreted as a regularization of the $L 1$ norm based on the Bayesian approach. A critical issue is the appropriate selection of hyperparameters values in the prior distributions of regularization techniques, as these values essentially control the sparsity in the estimated model. The purpose of this study was to evaluate different approaches for selecting the regularization parameter in BL, based on fully Bayesian approaches-such as gamma prior (BL_Gamma), beta prior (BL_Beta) and fixed prior (BL_Fixed) as well as data-driven approaches like cross-validation based on mean square error (BL_CV_MSE) and prediction accuracy (BL_CV_PA). Additionally, information-criteria-based methods including Akaike's information criterion (BL_AIC), Bayesian information criterion (BL_BIC) and Deviance information criterion (BL_DIC), were explored. For this purpose, a genome containing eight chromosomes (each 1 Morgan in length) with 100 randomly distributed quantitative trait loci was simulated. The studied scenarios were as follows: Scenario 1 involved 4000 markers and heritability of 0.2, scenario 2 involved 4000 markers and heritability of 0.6, scenario 3 involved 16,000 markers and heritability of 0.2; and scenario 4 involved 16,000 markers and heritability of 0.6. The results showed that among the fully Bayesian and cross-validation approaches, BL_Gamma, BL_Beta, and BL_CV_MSE provided the highest prediction accuracy (PA) in scenario 1 and 3. With increased marker density and heritability (scenario 4), the cross-validation approaches performed slightly better. The information-criteria-based methods demonstrated the lowest PA. Increasing heritability and marker density led to a decrease and an increase in the model penalty on the regression coefficients, respectively. The PA obtained in the target population ranged from 0.210 to 0.413 in Scenario 1, 0.402 to 0.600 in Scenario 2, 0.256 to 0.442 in Scenario 3, and 0.478 to 0.653 in Scenario 4. In generally, fully Bayesian approaches based on random priors for the regularization parameter are recommended for BL, as they provide acceptable PA with lower computational loads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Mammalian Genome 生物-生化与分子生物学

CiteScore

4.00

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： Mammalian Genome focuses on the experimental, theoretical and technical aspects of genetics, genomics, epigenetics and systems biology in mouse, human and other mammalian species, with an emphasis on the relationship between genotype and phenotype, elucidation of biological and disease pathways as well as experimental aspects of interventions, therapeutics, and precision medicine. The journal aims to publish high quality original papers that present novel findings in all areas of mammalian genetic research as well as review articles on areas of topical interest. The journal will also feature commentaries and editorials to inform readers of breakthrough discoveries as well as issues of research standards, policies and ethics.