The number of clusters in hybrid predictive models: does it really matter?

Mariusz Łapczyński, Bartłomiej Jefmański
{"title":"The number of clusters in hybrid predictive models: does it really matter?","authors":"Mariusz Łapczyński, Bartłomiej Jefmański","doi":"10.5604/01.3001.0013.9131","DOIUrl":null,"url":null,"abstract":"For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a new independent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15).\n\n","PeriodicalId":357447,"journal":{"name":"Przegląd Statystyczny","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Przegląd Statystyczny","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5604/01.3001.0013.9131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a new independent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15).
混合预测模型中的集群数量:真的重要吗?
很长一段时间以来,研究一直试图结合各种分析工具来建立预测模型。可以组合相同类型的工具(集成模型、委员会)或不同类型的工具(混合模型)。混合模型应用于客户关系管理(CRM)、网络使用挖掘、医学、石油地质和计算机网络异常检测等领域。我们的混合模型是作为聚类分析和决策树的顺序组合而创建的。在程序的第一步,使用k-means算法将对象分组到簇中。第二步涉及建立一个决策树模型,该模型带有一个新的自变量,该变量表示对象属于哪个集群。该分析基于从可公开访问的存储库收集的14个数据集。使用来自混淆矩阵的度量来评估模型的性能,包括准确性、精密度、召回率、f度量和第一和第二十分位数的提升。我们试图找到聚类数量和混合预测模型质量之间的关系。据我们所知,目前还没有类似的研究。我们的研究表明,在某些情况下,建立混合模型可以提高预测模型的性能。结果表明,具有最高性能度量的模型需要构建相对大量的集群(从9到15)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信