A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks

Paula Branco, L. Torgo
{"title":"A Study on the Impact of Data Characteristics in Imbalanced Regression Tasks","authors":"Paula Branco, L. Torgo","doi":"10.1109/DSAA.2019.00034","DOIUrl":null,"url":null,"abstract":"The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.","PeriodicalId":416037,"journal":{"name":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"40 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2019.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The class imbalance problem has been thoroughly studied over the past two decades. More recently, the research community realized that the problem of imbalanced distributions also occurred in other tasks beyond classification. Regression problems are among these newly studied tasks where the problem of imbalanced domains also poses important challenges. Imbalanced regression problems occur in a diversity of real world domains such as meteorological (predicting weather extreme values), financial (extreme stock returns forecasting) or medical (anticipate rare values). In imbalanced regression the end-user preferences are biased towards values of the target variable that are under-represented on the available data. Several pre-processing methods were proposed to address this problem. These methods change the training set to force the learner to focus on the rare cases. However, as far as we know, the relationship between the data intrinsic characteristics and the performance achieved by these methods has not yet been studied for imbalanced regression tasks. In this paper we describe a study of the impact certain data characteristics may have in the results of applying pre-processing methods to imbalanced regression problems. To achieve this goal, we define potentially interesting data characteristics of regression problems. We then conduct our study using a synthetic data repository build for this purpose. We show that all the different characteristics studied have a different behaviour that is related with the level at which the data characteristic is present and the learning algorithm used. The main contributions of our work are: i) to define interesting data characteristics for regression tasks; ii) to create the first repository of imbalanced regression tasks containing 6000 data sets with controlled data characteristics; and iii) to provide insights on the impact of intrinsic data characteristics in the results of pre-processing methods for handling imbalanced regression tasks.
非平衡回归任务中数据特征的影响研究
在过去的二十年里,人们对阶级失衡问题进行了深入的研究。最近,研究界意识到不平衡分布的问题也发生在分类以外的其他任务中。回归问题是这些新研究的任务之一,其中不平衡域问题也提出了重要的挑战。不平衡回归问题出现在现实世界的各种领域,如气象(预测天气极值)、金融(预测极端股票回报)或医疗(预测罕见值)。在不平衡回归中,最终用户偏好偏向于在可用数据上未充分代表的目标变量的值。针对这一问题,提出了几种预处理方法。这些方法改变了训练集,迫使学习者把注意力集中在罕见的情况上。然而,据我们所知,对于不平衡回归任务,数据内在特征与这些方法所获得的性能之间的关系尚未得到研究。在本文中,我们描述了对应用预处理方法处理不平衡回归问题的结果中某些数据特征可能产生的影响的研究。为了实现这一目标,我们定义了回归问题的潜在有趣的数据特征。然后,我们使用为此目的构建的合成数据存储库来进行研究。我们表明,所研究的所有不同特征都有不同的行为,这与数据特征存在的水平和所使用的学习算法有关。我们工作的主要贡献是:i)为回归任务定义有趣的数据特征;Ii)建立首个不平衡回归任务储存库,其中包含6000个具有受控数据特征的数据集;iii)提供数据内在特征对处理不平衡回归任务的预处理方法结果的影响的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信