Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits.

ILIRIA International Review Pub Date : 2019-11-05 DOI:10.1534/g3.119.400498

Christina B Azodi, Emily Bolger, Andrew McCarren, Mark Roantree, Gustavo de Los Campos, Shin-Han Shiu

{"title":"Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits.","authors":"Christina B Azodi, Emily Bolger, Andrew McCarren, Mark Roantree, Gustavo de Los Campos, Shin-Han Shiu","doi":"10.1534/g3.119.400498","DOIUrl":null,"url":null,"abstract":"The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.","PeriodicalId":31358,"journal":{"name":"ILIRIA International Review","volume":"8 1","pages":"3691-3702"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6829122/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ILIRIA International Review","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1534/g3.119.400498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.

查看原文本刊更多论文

用于复杂性状基因组预测的参数模型和机器学习模型基准。

基因组预测在作物和牲畜育种计划中的作用促使人们努力开发新的和改进的基因组预测算法，如人工神经网络和梯度树增强算法。然而，这些算法的性能尚未通过广泛的数据集和模型进行系统比较。我们利用六种植物的 18 个性状数据，以不同的标记密度和训练群体大小，比较了六种线性算法和六种非线性算法的性能。首先，我们发现超参数选择对所有非线性算法都是必要的，当标记数量大大超过训练线数量时，模型训练前的特征选择对人工神经网络至关重要。在所有物种和性状组合中，没有一种算法表现最好，但是基于多种算法结果组合的预测（即集合预测）表现一直很好。虽然线性和非线性算法在类似数量的性状上表现最佳，但非线性算法在不同性状上的表现差异较大。虽然人工神经网络在任何性状上的表现都不是最好的，但我们发现了一些策略（如特征选择、种子起始权重）能将其性能提升到接近其他算法的水平。我们的研究结果凸显了算法选择对预测性状值的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ILIRIA International Review

自引率

0.00%

发文量

审稿时长

6 weeks