Fast3VmrMLM: A fast algorithm that integrates genome-wide scanning with machine learning to accelerate gene mining and breeding by design for polygenic traits in large-scale GWAS datasets.
{"title":"Fast3VmrMLM: A fast algorithm that integrates genome-wide scanning with machine learning to accelerate gene mining and breeding by design for polygenic traits in large-scale GWAS datasets.","authors":"Jingtian Wang, Ying Chen, Guoping Shu, Miaomiao Zhao, Ao Zheng, Xiaoyu Chang, Guiqi Li, Yibo Wang, Yuan-Ming Zhang","doi":"10.1016/j.xplc.2025.101385","DOIUrl":null,"url":null,"abstract":"<p><p>Genetic dissection and breeding by design for polygenic traits remain challenges. To meet these challenges, it is important to identify as many genes as possible and key genes. Therefore, here, a genome-wide scanning plus machine learning framework was developed and integrated with advanced computational techniques to propose a novel algorithm called Fast3VmrMLM to mine more and key genes for polygenic traits in the era of big data and artificial intelligence. The algorithm was also extended to identify haplotype (Fast3VmrMLM-Hap) and molecular (Fast3VmrMLM-mQTL) variants. In simulation studies, Fast3VmrMLM outperformed existing methods in detecting dominant, small and rare variants, taking 3.30 and 5.43 hours (20 threads) to analyze the 18K rice and UK biobank-scale datasets, respectively. Fast3VmrMLM identified more known (211) and candidate (384) genes for 14 traits in the 18K rice dataset than FarmCPU (100 known genes), while Fast3VmrMLM identified 26 known and 24 candidate genes for 7 yield-related traits in a maize NC II design and Fast3VmrMLM-mQTL identified two known soybean genes around structural variants. We demonstrated that the new two-step framework outperformed genome-wide scanning alone. In breeding by design, a genetic network constructed by machine learning using all known/candidate genes in this study identified 21 key genes for rice yield-related traits, while all the associated markers gave high prediction accuracies in rice (0.7443) and maize (0.8492) and excellent hybrid combinations. A new breeding by design strategy based on the identified key genes was also proposed. This study provides an excellent method for gene mining and breeding by design.</p>","PeriodicalId":52373,"journal":{"name":"Plant Communications","volume":" ","pages":"101385"},"PeriodicalIF":9.4000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Communications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.xplc.2025.101385","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Genetic dissection and breeding by design for polygenic traits remain challenges. To meet these challenges, it is important to identify as many genes as possible and key genes. Therefore, here, a genome-wide scanning plus machine learning framework was developed and integrated with advanced computational techniques to propose a novel algorithm called Fast3VmrMLM to mine more and key genes for polygenic traits in the era of big data and artificial intelligence. The algorithm was also extended to identify haplotype (Fast3VmrMLM-Hap) and molecular (Fast3VmrMLM-mQTL) variants. In simulation studies, Fast3VmrMLM outperformed existing methods in detecting dominant, small and rare variants, taking 3.30 and 5.43 hours (20 threads) to analyze the 18K rice and UK biobank-scale datasets, respectively. Fast3VmrMLM identified more known (211) and candidate (384) genes for 14 traits in the 18K rice dataset than FarmCPU (100 known genes), while Fast3VmrMLM identified 26 known and 24 candidate genes for 7 yield-related traits in a maize NC II design and Fast3VmrMLM-mQTL identified two known soybean genes around structural variants. We demonstrated that the new two-step framework outperformed genome-wide scanning alone. In breeding by design, a genetic network constructed by machine learning using all known/candidate genes in this study identified 21 key genes for rice yield-related traits, while all the associated markers gave high prediction accuracies in rice (0.7443) and maize (0.8492) and excellent hybrid combinations. A new breeding by design strategy based on the identified key genes was also proposed. This study provides an excellent method for gene mining and breeding by design.
期刊介绍:
Plant Communications is an open access publishing platform that supports the global plant science community. It publishes original research, review articles, technical advances, and research resources in various areas of plant sciences. The scope of topics includes evolution, ecology, physiology, biochemistry, development, reproduction, metabolism, molecular and cellular biology, genetics, genomics, environmental interactions, biotechnology, breeding of higher and lower plants, and their interactions with other organisms. The goal of Plant Communications is to provide a high-quality platform for the dissemination of plant science research.