A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry

IF 6.4 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data Pub Date : 2024-05-09 DOI:10.1186/s40537-024-00922-9

Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah

{"title":"A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry","authors":"Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah","doi":"10.1186/s40537-024-00922-9","DOIUrl":null,"url":null,"abstract":"<p>In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"58 1","pages":""},"PeriodicalIF":6.4000,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s40537-024-00922-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.

Abstract Image

查看原文本刊更多论文

提高电信业客户流失预测准确性的混合框架建议

在电信行业，预测客户流失率的重要性近年来与日俱增。开发一个强大而准确的客户流失预测模型需要时间，但却至关重要。及早预测客户流失可避免收入损失并提高客户保留率。电信公司必须在客户离开之前识别出这些客户，以解决这一问题。研究人员使用了多种应用机器学习方法来揭示不同特征之间的隐藏关系。流失预测的一个关键方面是影响学习模型性能的准确度。本研究旨在阐明客户流失预测准确性的几个方面，并调查最先进技术的性能。然而，以前的研究还没有研究过使用混合框架的性能，该框架结合了选择合适的数据预处理、集合学习和重采样技术的优势。本研究提出了一种混合框架，可提高电信行业客户流失预测的准确性。该框架是通过将 XGBOOST 分类器与混合重采样方法 SMOTE-ENN 相结合而建立的，其中涉及应用有效的数据预处理技术。提出的框架在电信行业的三个数据集上进行了两次实验。本研究确定了哪些特征最关键并影响客户流失，介绍了数据平衡的影响，比较了分类器在数据平衡前和数据平衡后的性能，并研究了混合分类器在速度和准确性之间的权衡。在分析结果时使用了许多指标，包括准确度、精确度、召回率、F1-分数和 ROC 曲线。所有评价标准都用于确定最有效的实验。尊重平衡数据的混合框架的准确率结果优于仅应用于不平衡数据的分类器。此外，还将所提出的混合框架的结果与之前在相同数据集上的研究进行了比较，并提供了比较结果。与最新研究相比，我们提出的混合框架在三个数据集上的表现优于这些研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Big Data Computer Science-Information Systems

CiteScore

17.80

自引率

3.70%

发文量

105

审稿时长

13 weeks

期刊介绍： The Journal of Big Data publishes high-quality, scholarly research papers, methodologies, and case studies covering a broad spectrum of topics, from big data analytics to data-intensive computing and all applications of big data research. It addresses challenges facing big data today and in the future, including data capture and storage, search, sharing, analytics, technologies, visualization, architectures, data mining, machine learning, cloud computing, distributed systems, and scalable storage. The journal serves as a seminal source of innovative material for academic researchers and practitioners alike.