GPS: Harnessing data fusion strategies to improve the accuracy of machine learning-based genomic and phenotypic selection.

IF 11.6 1区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Plant Communications Pub Date : 2025-08-11 Epub Date: 2025-06-11 DOI:10.1016/j.xplc.2025.101416
Hongshan Wu, Shichao Jin, Chao Xiang, Jianling Tang, Junhong Xian, Jiaoping Zhang, Jinming Zhao, Xianzhong Feng, Dong Jiang, Yufeng Wu, Yanfeng Ding
{"title":"GPS: Harnessing data fusion strategies to improve the accuracy of machine learning-based genomic and phenotypic selection.","authors":"Hongshan Wu, Shichao Jin, Chao Xiang, Jianling Tang, Junhong Xian, Jiaoping Zhang, Jinming Zhao, Xianzhong Feng, Dong Jiang, Yufeng Wu, Yanfeng Ding","doi":"10.1016/j.xplc.2025.101416","DOIUrl":null,"url":null,"abstract":"<p><p>Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model's accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model's reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.</p>","PeriodicalId":52373,"journal":{"name":"Plant Communications","volume":" ","pages":"101416"},"PeriodicalIF":11.6000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365829/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Communications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.xplc.2025.101416","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model's accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model's reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.

GPS:利用数据融合策略提高基于机器学习的基因组和表型选择的准确性。
基因组选择(GS)和表型选择(PS)被广泛应用于植物育种。然而,这两种选择方法的准确性、鲁棒性和可移植性尚未得到充分的研究,特别是在处理复杂性状时。在这项研究中,我们引入了一个新的数据融合框架,GPS(基因组和表型选择),旨在通过三种不同的融合策略:数据融合、特征融合和结果融合,通过整合基因组和表型数据来提高预测性能。GPS框架的有效性和可泛化性通过广泛的模型套件进行严格测试,包括统计方法(GBLUP和BayesB)、机器学习模型(Lasso、RF、SVM、XGBoost和LightGBM)、深度学习方法(DNNGP)和最新的表型辅助预测模型(MAK)。这些模型被应用于四种作物物种的大规模数据集:玉米、大豆、水稻和小麦,证明了该框架的多功能性和鲁棒性。结果表明:(1)数据融合比特征融合和结果融合策略具有更好的准确率。性能最好的数据融合模型(Lasso_D)比最好的GS模型(LightGBM)提高了53.4%,比最好的PS模型(Lasso)提高了18.7%。(2) Lasso_D表现出优异的鲁棒性,即使样本量小到200,也能保持较高的预测精度。此外,该模型显示出对SNP(单核苷酸多态性)密度变化的弹性,强调了其对不同数据条件的适应性。此外,随着辅助性状数量的增加以及与目标性状的相关强度的增加,模型的准确性也在不断提高,进一步凸显了模型对复杂性状预测的适应性。(3)表现最好的数据融合模型具有广泛的可移植性,在合并多环境数据时,预测精度有了实质性的提高。值得注意的是,与使用来自相同环境的数据生成的预测相比,这种增强只导致准确性降低0.3%,证实了该模型在跨环境场景中的可靠性。这项研究提供了突破性的见解,推动了性状预测的预测准确性、稳健性和可转移性的界限。这些发现对植物科学、育种以及统计学和人工智能等更广泛的跨学科领域做出了重大贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Plant Communications
Plant Communications Agricultural and Biological Sciences-Plant Science
CiteScore
15.70
自引率
5.70%
发文量
105
审稿时长
6 weeks
期刊介绍: Plant Communications is an open access publishing platform that supports the global plant science community. It publishes original research, review articles, technical advances, and research resources in various areas of plant sciences. The scope of topics includes evolution, ecology, physiology, biochemistry, development, reproduction, metabolism, molecular and cellular biology, genetics, genomics, environmental interactions, biotechnology, breeding of higher and lower plants, and their interactions with other organisms. The goal of Plant Communications is to provide a high-quality platform for the dissemination of plant science research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信