Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data

Q1 Decision Sciences

Annals of Data Science Pub Date : 2024-08-13 DOI:10.1007/s40745-024-00575-8

Arshmeet Kaur, Morteza Sarmadi

{"title":"Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data","authors":"Arshmeet Kaur, Morteza Sarmadi","doi":"10.1007/s40745-024-00575-8","DOIUrl":null,"url":null,"abstract":"<div><p>Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However, genetic datasets often feature imbalanced target variables and high-cardinality, skewed predictor variables. These attributes complicate machine learning modeling processes. This study addresses these challenges in both regression and classification tasks. In this study, we systematically explored the impact of various data preprocessing techniques, feature selection methods, and model choices on the performance of machine learning models trained on imbalanced genetic data. We evaluated the performance metrics using fivefold cross-validation. Our key findings demonstrate that the regression models are robust to outliers and skew in predictor and target variables. Similarly, in classification tasks, class-imbalanced target variables and skewed predictors minimally impact model performance. Among the models tested, random forest was the most effective model for both imbalanced regression and classification tasks. Our key contributions are as follows: we address a significant research gap by focusing on imbalanced regression, a problem that is sparsely explored compared to class-imbalanced classification. We identify the techniques that improve prediction performance and provide practical insights into handling genetic data. Additionally, we provide a foundation for future research to further optimize machine learning approaches in genomics. This study uses a genetic dataset as a case, but our findings are applicable to imbalanced data in other fields.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"12 5","pages":"1553 - 1575"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Data Science","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s40745-024-00575-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Decision Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However, genetic datasets often feature imbalanced target variables and high-cardinality, skewed predictor variables. These attributes complicate machine learning modeling processes. This study addresses these challenges in both regression and classification tasks. In this study, we systematically explored the impact of various data preprocessing techniques, feature selection methods, and model choices on the performance of machine learning models trained on imbalanced genetic data. We evaluated the performance metrics using fivefold cross-validation. Our key findings demonstrate that the regression models are robust to outliers and skew in predictor and target variables. Similarly, in classification tasks, class-imbalanced target variables and skewed predictors minimally impact model performance. Among the models tested, random forest was the most effective model for both imbalanced regression and classification tasks. Our key contributions are as follows: we address a significant research gap by focusing on imbalanced regression, a problem that is sparsely explored compared to class-imbalanced classification. We identify the techniques that improve prediction performance and provide practical insights into handling genetic data. Additionally, we provide a foundation for future research to further optimize machine learning approaches in genomics. This study uses a genetic dataset as a case, but our findings are applicable to imbalanced data in other fields.

Abstract Image

查看原文本刊更多论文

不平衡遗传数据的机器学习技术比较分析

基因组测序技术的进步大大增加了基因组数据的可用性。使用机器学习模型来预测基因突变的致病性或临床意义至关重要。然而，遗传数据集往往具有不平衡的目标变量和高基数，偏斜的预测变量。这些属性使机器学习建模过程复杂化。本研究在回归和分类任务中解决了这些挑战。在这项研究中，我们系统地探讨了各种数据预处理技术、特征选择方法和模型选择对不平衡遗传数据训练的机器学习模型性能的影响。我们使用五倍交叉验证来评估性能指标。我们的主要发现表明回归模型对异常值具有鲁棒性，并且在预测变量和目标变量中存在偏态。同样，在分类任务中，类别不平衡的目标变量和倾斜的预测因子对模型性能的影响最小。在测试的模型中，随机森林模型对于不平衡回归和分类任务都是最有效的。我们的主要贡献如下：我们通过关注不平衡回归解决了一个重要的研究缺口，与类不平衡分类相比，这个问题很少被探索。我们确定了提高预测性能的技术，并为处理遗传数据提供了实际的见解。此外，我们为进一步优化基因组学中的机器学习方法的未来研究提供了基础。本研究以一个遗传数据集为例，但我们的发现也适用于其他领域的不平衡数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Data Science Decision Sciences-Statistics, Probability and Uncertainty

CiteScore

6.50

自引率

0.00%

发文量

期刊介绍： Annals of Data Science (ADS) publishes cutting-edge research findings, experimental results and case studies of data science. Although Data Science is regarded as an interdisciplinary field of using mathematics, statistics, databases, data mining, high-performance computing, knowledge management and virtualization to discover knowledge from Big Data, it should have its own scientific contents, such as axioms, laws and rules, which are fundamentally important for experts in different fields to explore their own interests from Big Data. ADS encourages contributors to address such challenging problems at this exchange platform. At present, how to discover knowledge from heterogeneous data under Big Data environment needs to be addressed. ADS is a series of volumes edited by either the editorial office or guest editors. Guest editors will be responsible for call-for-papers and the review process for high-quality contributions in their volumes.