Comparative Evaluation of Oversampling Techniques for Balancing Metabolic Profiles

Mikel Hernandez, Gorka Epelde, R. Gil-Redondo, N. Embade, Ane Alberdi, I. Macía, Ó. Millet
{"title":"Comparative Evaluation of Oversampling Techniques for Balancing Metabolic Profiles","authors":"Mikel Hernandez, Gorka Epelde, R. Gil-Redondo, N. Embade, Ane Alberdi, I. Macía, Ó. Millet","doi":"10.1145/3569192.3569200","DOIUrl":null,"url":null,"abstract":"The problem of imbalanced data is common when applying data analytics paradigms to binary and multiclass data, such as statistical analyses, predictive models, and classification metrics sensitive to imbalanced data, i.e., accuracy. Although there exist some pre-processing, algorithms, and hybrid approaches, none of them has a special focus on balancing metabolic profiles for Metabolic Syndrome analysis. Since the insights and conclusions obtained from data analysis paradigms applied to metabolic data are relevant to the topic, statistical power may be lost due to an imbalance between the Metabolic Syndrome related subclasses. Thus, there is a need to balance metabolic data to improve the insights derived from these types of analyses. In this context, this paper presents a comparative evaluation of six oversampling techniques for balancing metabolic profiles (SMOTE, B-SMOTE, ADASYN, ROS, K-SMOTE, and SVM-SMOTE). An imbalanced dataset with 16 classes from the combinations of 4 binary metabolic conditions is used for this analysis. Additionally, a methodology is defined to objectively evaluate and compare the six oversampling techniques in terms of representativity and variety. The results have shown that ROS and SMOTE have been the best oversampling techniques to balance metabolic data, generating high-quality synthetic profiles that resemble the real ones while balancing all classes equally. This demonstrates that metabolomics studies focused on metabolic syndrome can trust in these oversampling methods to improve their conclusions.","PeriodicalId":249004,"journal":{"name":"Proceedings of the 9th International Conference on Bioinformatics Research and Applications","volume":"493 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Conference on Bioinformatics Research and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3569192.3569200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The problem of imbalanced data is common when applying data analytics paradigms to binary and multiclass data, such as statistical analyses, predictive models, and classification metrics sensitive to imbalanced data, i.e., accuracy. Although there exist some pre-processing, algorithms, and hybrid approaches, none of them has a special focus on balancing metabolic profiles for Metabolic Syndrome analysis. Since the insights and conclusions obtained from data analysis paradigms applied to metabolic data are relevant to the topic, statistical power may be lost due to an imbalance between the Metabolic Syndrome related subclasses. Thus, there is a need to balance metabolic data to improve the insights derived from these types of analyses. In this context, this paper presents a comparative evaluation of six oversampling techniques for balancing metabolic profiles (SMOTE, B-SMOTE, ADASYN, ROS, K-SMOTE, and SVM-SMOTE). An imbalanced dataset with 16 classes from the combinations of 4 binary metabolic conditions is used for this analysis. Additionally, a methodology is defined to objectively evaluate and compare the six oversampling techniques in terms of representativity and variety. The results have shown that ROS and SMOTE have been the best oversampling techniques to balance metabolic data, generating high-quality synthetic profiles that resemble the real ones while balancing all classes equally. This demonstrates that metabolomics studies focused on metabolic syndrome can trust in these oversampling methods to improve their conclusions.
平衡代谢谱的过采样技术的比较评价
当将数据分析范式应用于二进制和多类数据时,不平衡数据的问题很常见,例如统计分析、预测模型和对不平衡数据敏感的分类指标,即准确性。虽然存在一些预处理、算法和混合方法,但它们都没有特别关注代谢综合征分析中平衡代谢谱的问题。由于从应用于代谢数据的数据分析范式中获得的见解和结论与主题相关,因此可能由于代谢综合征相关子类之间的不平衡而失去统计效力。因此,有必要平衡代谢数据,以提高从这些类型的分析中得出的见解。在此背景下,本文介绍了平衡代谢谱的六种过采样技术(SMOTE, B-SMOTE, ADASYN, ROS, K-SMOTE和SVM-SMOTE)的比较评估。该分析使用了一个包含16个类别的不平衡数据集,这些类别来自4种二元代谢条件的组合。此外,还定义了一种方法,以客观地评估和比较六种过采样技术的代表性和多样性。结果表明,ROS和SMOTE是平衡代谢数据的最佳过采样技术,在平衡所有类别的同时,生成与真实数据相似的高质量合成剖面。这表明以代谢综合征为重点的代谢组学研究可以信任这些过采样方法来改进他们的结论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信