Comprehensive Machine Learning Model for Cervical Cancer Prediction and Risk Factor Identification

IF 3 Q1 PSYCHOLOGY, MULTIDISCIPLINARY

Human Behavior and Emerging Technologies Pub Date : 2025-07-30 DOI:10.1155/hbe2/6629232

Mahendra, Mila Desi Anasanti

{"title":"Comprehensive Machine Learning Model for Cervical Cancer Prediction and Risk Factor Identification","authors":"Mahendra, Mila Desi Anasanti","doi":"10.1155/hbe2/6629232","DOIUrl":null,"url":null,"abstract":"<p>Cervical cancer presents a significant global health challenge, affecting patients and healthcare systems. Early identification and accurate prediction of risk factors are essential for reducing incidence and improving patient outcomes. This study focuses on predicting indicators and diagnosing cervical cancer using a comprehensive dataset that includes demographic information, lifestyle factors, and medical histories. We developed a predictive model to aid early diagnosis and identify key risk factors. The dataset consists of four cervical cancer tests—Hinselmann, Schiller, cytology, and biopsy—with 858 participants and 30 features. We addressed 22.14% of missing values using the MICE iterative imputer and balanced the data through the synthetic minority oversampling technique (SMOTE). We applied five machine learning algorithms: random forest (RF), linear regression (LR), support vector machine (SVM), <i>K</i>-nearest neighbors (KNN), and extreme gradient boosting (XGBoost). The SpFSR technique was utilized to enhance feature selection, assessing how a subset of features could maintain high accuracy compared to the full model. Our findings showed that selecting fewer features, such as half or even a quarter of the variables, still yielded strong results, emphasizing the importance of careful feature selection in cervical cancer prediction. The RF algorithm achieved the highest accuracy, with 99% using the full feature set and 98% with a reduced set of five features. Notably, diagnosis and hormonal contraceptives were identified as significant predictors. Hormonal contraceptives, which can affect cervical health, are linked to increased risks of HPV infection and cervical cancer. This study highlights the role of SpFSR in improving prediction models and suggests that external validation is necessary to confirm our findings in diverse populations. Further research should explore additional datasets and variables not covered in this study, as well as the model’s practical applicability in clinical settings.</p>","PeriodicalId":36408,"journal":{"name":"Human Behavior and Emerging Technologies","volume":"2025 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1155/hbe2/6629232","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Behavior and Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1155/hbe2/6629232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Cervical cancer presents a significant global health challenge, affecting patients and healthcare systems. Early identification and accurate prediction of risk factors are essential for reducing incidence and improving patient outcomes. This study focuses on predicting indicators and diagnosing cervical cancer using a comprehensive dataset that includes demographic information, lifestyle factors, and medical histories. We developed a predictive model to aid early diagnosis and identify key risk factors. The dataset consists of four cervical cancer tests—Hinselmann, Schiller, cytology, and biopsy—with 858 participants and 30 features. We addressed 22.14% of missing values using the MICE iterative imputer and balanced the data through the synthetic minority oversampling technique (SMOTE). We applied five machine learning algorithms: random forest (RF), linear regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), and extreme gradient boosting (XGBoost). The SpFSR technique was utilized to enhance feature selection, assessing how a subset of features could maintain high accuracy compared to the full model. Our findings showed that selecting fewer features, such as half or even a quarter of the variables, still yielded strong results, emphasizing the importance of careful feature selection in cervical cancer prediction. The RF algorithm achieved the highest accuracy, with 99% using the full feature set and 98% with a reduced set of five features. Notably, diagnosis and hormonal contraceptives were identified as significant predictors. Hormonal contraceptives, which can affect cervical health, are linked to increased risks of HPV infection and cervical cancer. This study highlights the role of SpFSR in improving prediction models and suggests that external validation is necessary to confirm our findings in diverse populations. Further research should explore additional datasets and variables not covered in this study, as well as the model’s practical applicability in clinical settings.

Abstract Image

查看原文本刊更多论文

宫颈癌预测和危险因素识别的综合机器学习模型

子宫颈癌是一项重大的全球卫生挑战，影响着患者和卫生保健系统。早期识别和准确预测危险因素对于降低发病率和改善患者预后至关重要。本研究的重点是使用包括人口统计信息、生活方式因素和病史在内的综合数据集预测指标和诊断宫颈癌。我们开发了一个预测模型来帮助早期诊断和识别关键的危险因素。该数据集包括四种宫颈癌检测——hinselmann、Schiller、细胞学和活组织检查——共有858名参与者和30个特征。我们使用MICE迭代输入器解决了22.14%的缺失值，并通过合成少数过采样技术（SMOTE）平衡了数据。我们应用了五种机器学习算法：随机森林（RF）、线性回归（LR）、支持向量机（SVM）、k近邻（KNN）和极端梯度增强（XGBoost）。利用SpFSR技术增强特征选择，评估特征子集与完整模型相比如何保持较高的准确性。我们的研究结果表明，选择更少的特征，如一半甚至四分之一的变量，仍然产生了强有力的结果，强调了仔细的特征选择在宫颈癌预测中的重要性。RF算法实现了最高的准确率，使用完整特征集的准确率为99%，使用精简的5个特征集的准确率为98%。值得注意的是，诊断和激素避孕药被确定为重要的预测因素。激素避孕药会影响宫颈健康，与HPV感染和宫颈癌的风险增加有关。该研究强调了SpFSR在改进预测模型中的作用，并表明需要外部验证才能在不同人群中证实我们的发现。进一步的研究应该探索本研究未涵盖的其他数据集和变量，以及该模型在临床环境中的实际适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Human Behavior and Emerging Technologies Social Sciences-Social Sciences (all)

CiteScore

17.20

自引率

8.70%

发文量

期刊介绍： Human Behavior and Emerging Technologies is an interdisciplinary journal dedicated to publishing high-impact research that enhances understanding of the complex interactions between diverse human behavior and emerging digital technologies.