{"title":"GPS:利用数据融合策略提高基于机器学习的基因组和表型选择的准确性。","authors":"Hongshan Wu, Shichao Jin, Chao Xiang, Jianling Tang, Junhong Xian, Jiaoping Zhang, Jinming Zhao, Xianzhong Feng, Dong Jiang, Yufeng Wu, Yanfeng Ding","doi":"10.1016/j.xplc.2025.101416","DOIUrl":null,"url":null,"abstract":"<p><p>Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model's accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model's reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.</p>","PeriodicalId":52373,"journal":{"name":"Plant Communications","volume":" ","pages":"101416"},"PeriodicalIF":11.6000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365829/pdf/","citationCount":"0","resultStr":"{\"title\":\"GPS: Harnessing data fusion strategies to improve the accuracy of machine learning-based genomic and phenotypic selection.\",\"authors\":\"Hongshan Wu, Shichao Jin, Chao Xiang, Jianling Tang, Junhong Xian, Jiaoping Zhang, Jinming Zhao, Xianzhong Feng, Dong Jiang, Yufeng Wu, Yanfeng Ding\",\"doi\":\"10.1016/j.xplc.2025.101416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model's accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model's reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.</p>\",\"PeriodicalId\":52373,\"journal\":{\"name\":\"Plant Communications\",\"volume\":\" \",\"pages\":\"101416\"},\"PeriodicalIF\":11.6000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365829/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Plant Communications\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1016/j.xplc.2025.101416\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Communications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.xplc.2025.101416","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
GPS: Harnessing data fusion strategies to improve the accuracy of machine learning-based genomic and phenotypic selection.
Genomic selection (GS) and phenotypic selection (PS) are widely used for accelerating plant breeding. However, the accuracy, robustness, and transferability of these two selection methods are underexplored, especially when addressing complex traits. In this study, we introduce a novel data fusion framework, GPS (genomic and phenotypic selection), designed to enhance predictive performance by integrating genomic and phenotypic data through three distinct fusion strategies: data fusion, feature fusion, and result fusion. The GPS framework was rigorously tested using an extensive suite of models, including statistical approaches (GBLUP and BayesB), machine learning models (Lasso, RF, SVM, XGBoost, and LightGBM), a deep learning method (DNNGP), and a recent phenotype-assisted prediction model (MAK). These models were applied to large datasets from four crop species, maize, soybean, rice, and wheat, demonstrating the versatility and robustness of the framework. Our results indicated that: (1) data fusion achieved the highest accuracy compared with the feature fusion and result fusion strategies. The top-performing data fusion model (Lasso_D) improved the selection accuracy by 53.4% compared to the best GS model (LightGBM) and by 18.7% compared to the best PS model (Lasso). (2) Lasso_D exhibited exceptional robustness, achieving high predictive accuracy even with a sample size as small as 200 and demonstrating resilience to single-nucleotide polymorphism (SNP) density variations, underscoring its adaptability to diverse data conditions. Moreover, the model's accuracy improved with the number of auxiliary traits and their correlation strength with target traits, further highlighting its adaptability to complex trait prediction. (3) Lasso_D demonstrated broad transferability, with substantial improvements in predictive accuracy when incorporating multi-environmental data. This enhancement resulted in only a 0.3% reduction in accuracy compared to predictions generated using data from the same environment, affirming the model's reliability in cross-environmental scenarios. This study provides groundbreaking insights, pushing the boundaries of predictive accuracy, robustness, and transferability in trait prediction. These findings represent a significant contribution to plant science, plant breeding, and the broader interdisciplinary fields of statistics and artificial intelligence.
期刊介绍:
Plant Communications is an open access publishing platform that supports the global plant science community. It publishes original research, review articles, technical advances, and research resources in various areas of plant sciences. The scope of topics includes evolution, ecology, physiology, biochemistry, development, reproduction, metabolism, molecular and cellular biology, genetics, genomics, environmental interactions, biotechnology, breeding of higher and lower plants, and their interactions with other organisms. The goal of Plant Communications is to provide a high-quality platform for the dissemination of plant science research.