多输出和堆叠方法对使用机器学习算法从基因型预测饲料效率的影响。

IF 1.9 3区农林科学 Q2 AGRICULTURE, DAIRY & ANIMAL SCIENCE

Journal of Animal Breeding and Genetics Pub Date : 2023-07-05 DOI:10.1111/jbg.12815

Mónica Mora, Pablo González, José Ramón Quevedo, Elena Montañés, Llibertat Tusell, Rob Bergsma, Miriam Piles

{"title":"多输出和堆叠方法对使用机器学习算法从基因型预测饲料效率的影响。","authors":"Mónica Mora, Pablo González, José Ramón Quevedo, Elena Montañés, Llibertat Tusell, Rob Bergsma, Miriam Piles","doi":"10.1111/jbg.12815","DOIUrl":null,"url":null,"abstract":"<p>Feeding represents the largest economic cost in meat production; therefore, selection to improve traits related to feed efficiency is a goal in most livestock breeding programs. Residual feed intake (RFI), that is, the difference between the actual and the expected feed intake based on animal's requirements, has been used as the selection criteria to improve feed efficiency since it was proposed by Kotch in 1963. In growing pigs, it is computed as the residual of the multiple regression model of daily feed intake (DFI), on average daily gain (ADG), backfat thickness (BFT), and metabolic body weight (MW). Recently, prediction using single-output machine learning algorithms and information from SNPs as predictor variables have been proposed for genomic selection in growing pigs, but like in other species, the prediction quality achieved for RFI has been generally poor. However, it has been suggested that it could be improved through multi-output or stacking methods. For this purpose, four strategies were implemented to predict RFI. Two of them correspond to the computation of RFI in an indirect way using the predicted values of its components obtained from (i) individual (multiple single-output strategy) or (ii) simultaneous predictions (multi-output strategy). The other two correspond to the direct prediction of RFI using (iii) the individual predictions of its components as predictor variables jointly with the genotype (stacking strategy), or (iv) using only the genotypes as predictors of RFI (single-output strategy). The single-output strategy was considered the benchmark. This research aimed to test the former three hypotheses using data recorded from 5828 growing pigs and 45,610 SNPs. For all the strategies two different learning methods were fitted: random forest (RF) and support vector regression (SVR). A nested cross-validation (CV) with an outer 10-folds CV and an inner threefold CV for hyperparameter tuning was implemented to test all strategies. This scheme was repeated using as predictor variables different subsets with an increasing number (from 200 to 3000) of the most informative SNPs identified with RF. Results showed that the highest prediction performance was achieved with 1000 SNPs, although the stability of feature selection was poor (0.13 points out of 1). For all SNP subsets, the benchmark showed the best prediction performance. Using the RF as a learner and the 1000 most informative SNPs as predictors, the mean (SD) of the 10 values obtained in the test sets were: 0.23 (0.04) for the Spearman correlation, 0.83 (0.04) for the zero–one loss, and 0.33 (0.03) for the rank distance loss. We conclude that the information on predicted components of RFI (DFI, ADG, MW, and BFT) does not contribute to improve the quality of the prediction of this trait in relation to the one obtained with the single-output strategy.</p>","PeriodicalId":54885,"journal":{"name":"Journal of Animal Breeding and Genetics","volume":"140 6","pages":"638-652"},"PeriodicalIF":1.9000,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jbg.12815","citationCount":"0","resultStr":"{\"title\":\"Impact of multi-output and stacking methods on feed efficiency prediction from genotype using machine learning algorithms\",\"authors\":\"Mónica Mora, Pablo González, José Ramón Quevedo, Elena Montañés, Llibertat Tusell, Rob Bergsma, Miriam Piles\",\"doi\":\"10.1111/jbg.12815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Feeding represents the largest economic cost in meat production; therefore, selection to improve traits related to feed efficiency is a goal in most livestock breeding programs. Residual feed intake (RFI), that is, the difference between the actual and the expected feed intake based on animal's requirements, has been used as the selection criteria to improve feed efficiency since it was proposed by Kotch in 1963. In growing pigs, it is computed as the residual of the multiple regression model of daily feed intake (DFI), on average daily gain (ADG), backfat thickness (BFT), and metabolic body weight (MW). Recently, prediction using single-output machine learning algorithms and information from SNPs as predictor variables have been proposed for genomic selection in growing pigs, but like in other species, the prediction quality achieved for RFI has been generally poor. However, it has been suggested that it could be improved through multi-output or stacking methods. For this purpose, four strategies were implemented to predict RFI. Two of them correspond to the computation of RFI in an indirect way using the predicted values of its components obtained from (i) individual (multiple single-output strategy) or (ii) simultaneous predictions (multi-output strategy). The other two correspond to the direct prediction of RFI using (iii) the individual predictions of its components as predictor variables jointly with the genotype (stacking strategy), or (iv) using only the genotypes as predictors of RFI (single-output strategy). The single-output strategy was considered the benchmark. This research aimed to test the former three hypotheses using data recorded from 5828 growing pigs and 45,610 SNPs. For all the strategies two different learning methods were fitted: random forest (RF) and support vector regression (SVR). A nested cross-validation (CV) with an outer 10-folds CV and an inner threefold CV for hyperparameter tuning was implemented to test all strategies. This scheme was repeated using as predictor variables different subsets with an increasing number (from 200 to 3000) of the most informative SNPs identified with RF. Results showed that the highest prediction performance was achieved with 1000 SNPs, although the stability of feature selection was poor (0.13 points out of 1). For all SNP subsets, the benchmark showed the best prediction performance. Using the RF as a learner and the 1000 most informative SNPs as predictors, the mean (SD) of the 10 values obtained in the test sets were: 0.23 (0.04) for the Spearman correlation, 0.83 (0.04) for the zero–one loss, and 0.33 (0.03) for the rank distance loss. We conclude that the information on predicted components of RFI (DFI, ADG, MW, and BFT) does not contribute to improve the quality of the prediction of this trait in relation to the one obtained with the single-output strategy.</p>\",\"PeriodicalId\":54885,\"journal\":{\"name\":\"Journal of Animal Breeding and Genetics\",\"volume\":\"140 6\",\"pages\":\"638-652\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jbg.12815\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Animal Breeding and Genetics\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jbg.12815\",\"RegionNum\":3,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AGRICULTURE, DAIRY & ANIMAL SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Animal Breeding and Genetics","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jbg.12815","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AGRICULTURE, DAIRY & ANIMAL SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

饲养是肉类生产中最大的经济成本；因此，选择与饲料效率相关的性状是大多数畜牧业育种计划的目标。剩余饲料摄入量（RFI），即基于动物需求的实际和预期饲料摄入量之间的差异，自1963年Kotch提出以来，一直被用作提高饲料效率的选择标准。在生长猪中，它被计算为日采食量（DFI）、平均日增重（ADG）、背肥厚度（BFT）和代谢体重（MW）的多元回归模型的残差。最近，已经提出了使用单输出机器学习算法和SNPs信息作为预测变量的预测，用于生长猪的基因组选择，但与其他物种一样，RFI的预测质量普遍较差。然而，有人建议可以通过多输出或堆叠方法对其进行改进。为此，实施了四种策略来预测RFI。其中两个对应于使用从（i）单独（多个单输出策略）或（ii）同时预测（多输出策略）获得的RFI分量的预测值以间接方式计算RFI。另外两个对应于RFI的直接预测，使用（iii）其成分的个体预测作为预测变量与基因型（叠加策略），或（iv）仅使用基因型作为RFI的预测因子（单一输出策略）。单一产出战略被视为基准。本研究旨在使用5828头生长猪和45610个SNPs的数据来检验前三个假设。对于所有策略，拟合了两种不同的学习方法：随机森林（RF）和支持向量回归（SVR）。实现了用于超参数调整的具有外部10倍CV和内部3倍CV的嵌套交叉验证（CV），以测试所有策略。使用不同的子集作为预测变量重复该方案，其中RF识别的信息量最大的SNP的数量不断增加（从200到3000）。结果表明，1000个SNPs的预测性能最高，尽管特征选择的稳定性较差（0.13分（满分1分））。对于所有SNP子集，基准显示出最佳的预测性能。使用RF作为学习者，使用1000个信息量最大的SNPs作为预测因子，在测试集中获得的10个值的平均值（SD）为：Spearman相关性为0.23（0.04），零一损失为0.83（0.04）和秩距离损失为0.33（0.03）。我们得出的结论是，与单输出策略相比，RFI的预测成分（DFI、ADG、MW和BFT）的信息无助于提高该特性的预测质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Impact of multi-output and stacking methods on feed efficiency prediction from genotype using machine learning algorithms

查看原文本刊更多论文

Impact of multi-output and stacking methods on feed efficiency prediction from genotype using machine learning algorithms

Feeding represents the largest economic cost in meat production; therefore, selection to improve traits related to feed efficiency is a goal in most livestock breeding programs. Residual feed intake (RFI), that is, the difference between the actual and the expected feed intake based on animal's requirements, has been used as the selection criteria to improve feed efficiency since it was proposed by Kotch in 1963. In growing pigs, it is computed as the residual of the multiple regression model of daily feed intake (DFI), on average daily gain (ADG), backfat thickness (BFT), and metabolic body weight (MW). Recently, prediction using single-output machine learning algorithms and information from SNPs as predictor variables have been proposed for genomic selection in growing pigs, but like in other species, the prediction quality achieved for RFI has been generally poor. However, it has been suggested that it could be improved through multi-output or stacking methods. For this purpose, four strategies were implemented to predict RFI. Two of them correspond to the computation of RFI in an indirect way using the predicted values of its components obtained from (i) individual (multiple single-output strategy) or (ii) simultaneous predictions (multi-output strategy). The other two correspond to the direct prediction of RFI using (iii) the individual predictions of its components as predictor variables jointly with the genotype (stacking strategy), or (iv) using only the genotypes as predictors of RFI (single-output strategy). The single-output strategy was considered the benchmark. This research aimed to test the former three hypotheses using data recorded from 5828 growing pigs and 45,610 SNPs. For all the strategies two different learning methods were fitted: random forest (RF) and support vector regression (SVR). A nested cross-validation (CV) with an outer 10-folds CV and an inner threefold CV for hyperparameter tuning was implemented to test all strategies. This scheme was repeated using as predictor variables different subsets with an increasing number (from 200 to 3000) of the most informative SNPs identified with RF. Results showed that the highest prediction performance was achieved with 1000 SNPs, although the stability of feature selection was poor (0.13 points out of 1). For all SNP subsets, the benchmark showed the best prediction performance. Using the RF as a learner and the 1000 most informative SNPs as predictors, the mean (SD) of the 10 values obtained in the test sets were: 0.23 (0.04) for the Spearman correlation, 0.83 (0.04) for the zero–one loss, and 0.33 (0.03) for the rank distance loss. We conclude that the information on predicted components of RFI (DFI, ADG, MW, and BFT) does not contribute to improve the quality of the prediction of this trait in relation to the one obtained with the single-output strategy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Animal Breeding and Genetics 农林科学-奶制品与动物科学

CiteScore

5.20

自引率

3.80%

发文量

审稿时长

12-24 weeks

期刊介绍： The Journal of Animal Breeding and Genetics publishes original articles by international scientists on genomic selection, and any other topic related to breeding programmes, selection, quantitative genetic, genomics, diversity and evolution of domestic animals. Researchers, teachers, and the animal breeding industry will find the reports of interest. Book reviews appear in many issues.