Comparative Analysis of Predictive Analytics Models in Classification Problems

2019 Actual Problems of Systems and Software Engineering (APSSE) Pub Date : 2019-11-01 DOI:10.1109/APSSE47353.2019.00028

K. Polyakov, Liudmila Zhukova

{"title":"Comparative Analysis of Predictive Analytics Models in Classification Problems","authors":"K. Polyakov, Liudmila Zhukova","doi":"10.1109/APSSE47353.2019.00028","DOIUrl":null,"url":null,"abstract":"Present research is devoted to the comparative analysis of the quality of classification for some methods of descriptive and predictive analytics in the case when most (or all) of independent variables are measured in quality scale with large amount of levels. In this case, some classification methods or their popular realizations calls for conversion of quality variables into systems of dummy variables. If quality scales have large amount of levels which are presented in almost equal proportions in the training set, i.e. it doesn't make sense to enlarge levels, above mentioned requirement will lead to the dramatically rise of problem dimension. As a result, researcher is faced with the curse of dimensionality. It means that, if the problem dimension rise, it'll be necessary to rise the sample size to preserve factors impact estimation accuracy. At the same time, it's not always possible to arrange appropriate growth of the training set volume. In some cases, it's limited by specific properties of the body of interest (system). If such situation appears, it'll be extremely important to evaluate the sensitivity of prediction/classification methods to the curse of dimensionality. Authors of this research focused on the four method of classification, which earn first lines in the lists of the popular methods of business analysis long ago. There are: • Two methods of classification tree building — CART and C4.5 • Logistic regression • Classification on the basis of random forest The first three are descriptive methods, which let's get interpreting (man ready) models, the fourth belongs to predictive analytics. Selection is not random. Descriptive analytics problems extremely important for the process of planning, when it's necessary to get answer on the question \"What will be if …?\". Particularly, one need to get target group description for organization of marketing communication. At the same time, it is quite conceivable that utilization of interpreting (man ready) models involves loss of prediction quality in comparison with methods of predictive analytics. The current research domain is the activity of microfinancing institutions (MFIs). Traditional problem here is the potential client assessment. The main challenge, which arise in the process of above mentioned problem solution, is the constraints on the volume, composition and type of data, which is available for prediction of default or default probability assessment. Thus, it's necessary to evaluate the abilities of classification methods which were designed for work with large amount of data (it means big size of the training set and a lot of variables, from which the most important should be selected). In real practice of microfinancing organization, the most of recorded factors are measured on the qualitative scales with large amount of levels, what is the origin of the above-mentioned problems. The empirical part of the research is grounded on the data of real microfinancing organization. Some hypotheses about the reasons of default were tested as byproduct of this research.","PeriodicalId":146774,"journal":{"name":"2019 Actual Problems of Systems and Software Engineering (APSSE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Actual Problems of Systems and Software Engineering (APSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSSE47353.2019.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Present research is devoted to the comparative analysis of the quality of classification for some methods of descriptive and predictive analytics in the case when most (or all) of independent variables are measured in quality scale with large amount of levels. In this case, some classification methods or their popular realizations calls for conversion of quality variables into systems of dummy variables. If quality scales have large amount of levels which are presented in almost equal proportions in the training set, i.e. it doesn't make sense to enlarge levels, above mentioned requirement will lead to the dramatically rise of problem dimension. As a result, researcher is faced with the curse of dimensionality. It means that, if the problem dimension rise, it'll be necessary to rise the sample size to preserve factors impact estimation accuracy. At the same time, it's not always possible to arrange appropriate growth of the training set volume. In some cases, it's limited by specific properties of the body of interest (system). If such situation appears, it'll be extremely important to evaluate the sensitivity of prediction/classification methods to the curse of dimensionality. Authors of this research focused on the four method of classification, which earn first lines in the lists of the popular methods of business analysis long ago. There are: • Two methods of classification tree building — CART and C4.5 • Logistic regression • Classification on the basis of random forest The first three are descriptive methods, which let's get interpreting (man ready) models, the fourth belongs to predictive analytics. Selection is not random. Descriptive analytics problems extremely important for the process of planning, when it's necessary to get answer on the question "What will be if …?". Particularly, one need to get target group description for organization of marketing communication. At the same time, it is quite conceivable that utilization of interpreting (man ready) models involves loss of prediction quality in comparison with methods of predictive analytics. The current research domain is the activity of microfinancing institutions (MFIs). Traditional problem here is the potential client assessment. The main challenge, which arise in the process of above mentioned problem solution, is the constraints on the volume, composition and type of data, which is available for prediction of default or default probability assessment. Thus, it's necessary to evaluate the abilities of classification methods which were designed for work with large amount of data (it means big size of the training set and a lot of variables, from which the most important should be selected). In real practice of microfinancing organization, the most of recorded factors are measured on the qualitative scales with large amount of levels, what is the origin of the above-mentioned problems. The empirical part of the research is grounded on the data of real microfinancing organization. Some hypotheses about the reasons of default were tested as byproduct of this research.

查看原文本刊更多论文

分类问题中预测分析模型的比较分析

目前的研究主要是在大多数(或全部)自变量都是以大量层次的质量尺度来度量的情况下，对一些描述性和预测性分析方法的分类质量进行比较分析。在这种情况下，一些分类方法或其流行的实现要求将质量变量转换为虚拟变量系统。如果质量尺度有大量的层次，而这些层次在训练集中呈现的比例几乎相等，即扩大层次是没有意义的，上述要求将导致问题维数急剧上升。因此，研究人员面临着维度的诅咒。这意味着，如果问题维度增加，则需要增加样本量以保持因素影响估计的准确性。同时，并不总是可以安排适当的训练集容量增长。在某些情况下，它受到感兴趣主体(系统)的特定属性的限制。如果出现这种情况，评估预测/分类方法对维数诅咒的敏感性就显得尤为重要。这项研究的作者着重于四种分类方法，这些方法很久以前就在流行的商业分析方法列表中名列前茅。•建立分类树的两种方法- CART和C4.5•逻辑回归•基于随机森林的分类前三种是描述性方法，让我们得到解释(man ready)模型，第四种属于预测分析。选择不是随机的。描述性分析问题对于计划过程非常重要，当有必要得到“如果……会怎样?”这个问题的答案时。特别是需要对营销传播组织的目标群体进行描述。同时，可以想象，与预测分析方法相比，使用解释(现成的)模型会导致预测质量的损失。目前的研究领域是小额信贷机构的活动。这里的传统问题是潜在客户评估。在解决上述问题的过程中出现的主要挑战是对用于预测违约或违约概率评估的数据的数量、组成和类型的限制。因此，有必要评估为处理大量数据而设计的分类方法的能力(这意味着训练集的规模很大，变量很多，应该从中选择最重要的)。在小额信贷组织的实际实践中，大多数记录的因素都是在大量层次的定性尺度上测量的，这是上述问题的根源。研究的实证部分是基于真实小额信贷组织的数据。作为本研究的副产品，对违约原因的一些假设进行了检验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 Actual Problems of Systems and Software Engineering (APSSE)

自引率

0.00%

发文量