Hawlader A. Al‐Mamun, Monica F. Danilevicz, Jacob I. Marsh, Cedric Gondro, David Edwards
{"title":"Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large‐scale soybean dataset","authors":"Hawlader A. Al‐Mamun, Monica F. Danilevicz, Jacob I. Marsh, Cedric Gondro, David Edwards","doi":"10.1002/tpg2.20503","DOIUrl":null,"url":null,"abstract":"The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (<jats:italic>Glycine max</jats:italic> L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “<jats:italic>small n large p</jats:italic>” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.","PeriodicalId":501653,"journal":{"name":"The Plant Genome","volume":"36 1","pages":"e20503"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Plant Genome","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/tpg2.20503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The surge in high‐throughput technologies has empowered the acquisition of vast genomic datasets, prompting the search for genetic markers and biomarkers relevant to complex traits. However, grappling with the inherent complexities of high dimensionality and sparsity within these datasets poses formidable hurdles. The immense number of features and their potential redundancy demand efficient strategies for extracting pertinent information and identifying significant markers. Feature selection is important in large genomic data as it helps in enhancing interpretability and computational efficiency. This study focuses on addressing these challenges through a comprehensive investigation into genomic feature selection methodologies, employing a rich soybean (Glycine max L. Merr.) dataset comprising 966 lines with over 5.5 million single nucleotide polymorphisms. Emphasizing the “small n large p” dilemma prevalent in contemporary genomic studies, we compared the efficacy of traditional genome‐wide association studies (GWAS) with two prominent machine learning tools, random forest and extreme gradient boosting, in pinpointing predictive features. Utilizing the expansive soybean dataset, we assessed the performance of these methodologies in selecting features that optimize predictive modeling for various phenotypes. By constructing predictive models based on the selected features, we ascertain the comparative prediction accuracies, thereby illuminating the strengths and limitations of these feature selection methodologies in the realm of genomic data analysis.
高通量技术的迅猛发展为获取庞大的基因组数据集提供了可能,促使人们寻找与复杂性状相关的遗传标记和生物标志物。然而,要解决这些数据集固有的高维性和稀疏性等复杂问题,却面临着巨大的障碍。大量的特征及其潜在的冗余性要求采用高效的策略来提取相关信息并识别重要标记。特征选择在大型基因组数据中非常重要,因为它有助于提高可解释性和计算效率。本研究通过对基因组特征选择方法的全面调查,采用丰富的大豆(Glycine max L. Merr.)数据集,包括 966 个品系和 550 多万个单核苷酸多态性,重点解决这些挑战。我们强调了当代基因组研究中普遍存在的 "小 n 大 p "困境,比较了传统的全基因组关联研究(GWAS)与随机森林和极端梯度提升这两种著名的机器学习工具在确定预测特征方面的功效。利用广阔的大豆数据集,我们评估了这些方法在选择优化各种表型预测模型的特征方面的性能。通过基于所选特征构建预测模型,我们确定了预测准确率的比较,从而阐明了这些特征选择方法在基因组数据分析领域的优势和局限性。