Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset

Hawlader A. Al‐Mamun, Monica F. Danilevicz, Jacob I. Marsh, Cedric Gondro, David Edwards
{"title":"Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset","authors":"Hawlader A. Al‐Mamun, Monica F. Danilevicz, Jacob I. Marsh, Cedric Gondro, David Edwards","doi":"10.1002/tpg2.20503","DOIUrl":null,"url":null,"abstract":"The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (<jats:italic>Glycine max</jats:italic> L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “<jats:italic>small n large p</jats:italic>” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.","PeriodicalId":501653,"journal":{"name":"The Plant Genome","volume":"36 1","pages":"e20503"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Plant Genome","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/tpg2.20503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (Glycine max L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “small n large p” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.
探索基因组特征选择:大规模大豆数据集中 GWAS 和机器学习算法的比较分析
高通量技术的迅猛发展为获取庞大的基因组数据集提供了可能,促使人们寻找与复杂性状相关的遗传标记和生物标志物。然而,要解决这些数据集固有的高维性和稀疏性等复杂问题,却面临着巨大的障碍。大量的特征及其潜在的冗余性要求采用高效的策略来提取相关信息并识别重要标记。特征选择在大型基因组数据中非常重要,因为它有助于提高可解释性和计算效率。本研究通过对基因组特征选择方法的全面调查,采用丰富的大豆(Glycine max L. Merr.)数据集,包括 966 个品系和 550 多万个单核苷酸多态性,重点解决这些挑战。我们强调了当代基因组研究中普遍存在的 "小 n 大 p "困境,比较了传统的全基因组关联研究(GWAS)与随机森林和极端梯度提升这两种著名的机器学习工具在确定预测特征方面的功效。利用广阔的大豆数据集,我们评估了这些方法在选择优化各种表型预测模型的特征方面的性能。通过基于所选特征构建预测模型,我们确定了预测准确率的比较,从而阐明了这些特征选择方法在基因组数据分析领域的优势和局限性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信