处理高维生物医学数据集中的类不平衡

2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE) Pub Date : 2019-06-01 DOI:10.1109/WETICE.2019.00040

B. Pes

{"title":"处理高维生物医学数据集中的类不平衡","authors":"B. Pes","doi":"10.1109/WETICE.2019.00040","DOIUrl":null,"url":null,"abstract":"When dealing with biomedical data, the first and most challenging issue is often the huge dimensionality, i.e. the presence of a very high number of features for each of the problem instances at hand. A vast literature is available on different dimensionality reduction techniques that can be suitable for handling such kind of data, with a special focus on feature selection algorithms that allow to discard uninformative/useless features. In most cases, however, the dimensionality issue is addressed without a joint consideration of other potential problems in the data, including an imbalanced class distribution that may hinder the construction of effective classification models. Class imbalance, in turn, has been mostly treated in literature as an independent problem, especially in application fields where the number of features is not so critical. But several biomedical datasets are both high-dimensional and class-imbalanced, so there is a strong need for designing and evaluating learning strategies that can properly deal with both the issues simultaneously. In this work, we experiment with using feature selection techniques in conjunction with sampling-based class balancing methods and cost-sensitive classification, in order to gain insight into the most effective strategies to use when dealing with such complex data.","PeriodicalId":116875,"journal":{"name":"2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Handling Class Imbalance in High-Dimensional Biomedical Datasets\",\"authors\":\"B. Pes\",\"doi\":\"10.1109/WETICE.2019.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When dealing with biomedical data, the first and most challenging issue is often the huge dimensionality, i.e. the presence of a very high number of features for each of the problem instances at hand. A vast literature is available on different dimensionality reduction techniques that can be suitable for handling such kind of data, with a special focus on feature selection algorithms that allow to discard uninformative/useless features. In most cases, however, the dimensionality issue is addressed without a joint consideration of other potential problems in the data, including an imbalanced class distribution that may hinder the construction of effective classification models. Class imbalance, in turn, has been mostly treated in literature as an independent problem, especially in application fields where the number of features is not so critical. But several biomedical datasets are both high-dimensional and class-imbalanced, so there is a strong need for designing and evaluating learning strategies that can properly deal with both the issues simultaneously. In this work, we experiment with using feature selection techniques in conjunction with sampling-based class balancing methods and cost-sensitive classification, in order to gain insight into the most effective strategies to use when dealing with such complex data.\",\"PeriodicalId\":116875,\"journal\":{\"name\":\"2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WETICE.2019.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WETICE.2019.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在处理生物医学数据时，第一个也是最具挑战性的问题通常是巨大的维度，即手头的每个问题实例都存在非常多的特征。关于不同的降维技术可以适用于处理这类数据的大量文献，特别关注允许丢弃无信息/无用特征的特征选择算法。然而，在大多数情况下，维数问题是在没有联合考虑数据中其他潜在问题的情况下解决的，包括可能阻碍有效分类模型构建的不平衡类分布。而在文学中，类失衡大多被视为一个独立的问题，尤其是在特征数量不那么重要的应用领域。但是，一些生物医学数据集既高维又类不平衡，因此迫切需要设计和评估能够同时适当处理这两个问题的学习策略。在这项工作中，我们尝试将特征选择技术与基于采样的类平衡方法和成本敏感分类相结合，以便深入了解在处理此类复杂数据时使用的最有效策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Handling Class Imbalance in High-Dimensional Biomedical Datasets

When dealing with biomedical data, the first and most challenging issue is often the huge dimensionality, i.e. the presence of a very high number of features for each of the problem instances at hand. A vast literature is available on different dimensionality reduction techniques that can be suitable for handling such kind of data, with a special focus on feature selection algorithms that allow to discard uninformative/useless features. In most cases, however, the dimensionality issue is addressed without a joint consideration of other potential problems in the data, including an imbalanced class distribution that may hinder the construction of effective classification models. Class imbalance, in turn, has been mostly treated in literature as an independent problem, especially in application fields where the number of features is not so critical. But several biomedical datasets are both high-dimensional and class-imbalanced, so there is a strong need for designing and evaluating learning strategies that can properly deal with both the issues simultaneously. In this work, we experiment with using feature selection techniques in conjunction with sampling-based class balancing methods and cost-sensitive classification, in order to gain insight into the most effective strategies to use when dealing with such complex data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)

自引率

0.00%

发文量