{"title":"Comprehensive Machine Learning Model for Cervical Cancer Prediction and Risk Factor Identification","authors":"Mahendra, Mila Desi Anasanti","doi":"10.1155/hbe2/6629232","DOIUrl":null,"url":null,"abstract":"<p>Cervical cancer presents a significant global health challenge, affecting patients and healthcare systems. Early identification and accurate prediction of risk factors are essential for reducing incidence and improving patient outcomes. This study focuses on predicting indicators and diagnosing cervical cancer using a comprehensive dataset that includes demographic information, lifestyle factors, and medical histories. We developed a predictive model to aid early diagnosis and identify key risk factors. The dataset consists of four cervical cancer tests—Hinselmann, Schiller, cytology, and biopsy—with 858 participants and 30 features. We addressed 22.14% of missing values using the MICE iterative imputer and balanced the data through the synthetic minority oversampling technique (SMOTE). We applied five machine learning algorithms: random forest (RF), linear regression (LR), support vector machine (SVM), <i>K</i>-nearest neighbors (KNN), and extreme gradient boosting (XGBoost). The SpFSR technique was utilized to enhance feature selection, assessing how a subset of features could maintain high accuracy compared to the full model. Our findings showed that selecting fewer features, such as half or even a quarter of the variables, still yielded strong results, emphasizing the importance of careful feature selection in cervical cancer prediction. The RF algorithm achieved the highest accuracy, with 99% using the full feature set and 98% with a reduced set of five features. Notably, diagnosis and hormonal contraceptives were identified as significant predictors. Hormonal contraceptives, which can affect cervical health, are linked to increased risks of HPV infection and cervical cancer. This study highlights the role of SpFSR in improving prediction models and suggests that external validation is necessary to confirm our findings in diverse populations. Further research should explore additional datasets and variables not covered in this study, as well as the model’s practical applicability in clinical settings.</p>","PeriodicalId":36408,"journal":{"name":"Human Behavior and Emerging Technologies","volume":"2025 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1155/hbe2/6629232","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Behavior and Emerging Technologies","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1155/hbe2/6629232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Cervical cancer presents a significant global health challenge, affecting patients and healthcare systems. Early identification and accurate prediction of risk factors are essential for reducing incidence and improving patient outcomes. This study focuses on predicting indicators and diagnosing cervical cancer using a comprehensive dataset that includes demographic information, lifestyle factors, and medical histories. We developed a predictive model to aid early diagnosis and identify key risk factors. The dataset consists of four cervical cancer tests—Hinselmann, Schiller, cytology, and biopsy—with 858 participants and 30 features. We addressed 22.14% of missing values using the MICE iterative imputer and balanced the data through the synthetic minority oversampling technique (SMOTE). We applied five machine learning algorithms: random forest (RF), linear regression (LR), support vector machine (SVM), K-nearest neighbors (KNN), and extreme gradient boosting (XGBoost). The SpFSR technique was utilized to enhance feature selection, assessing how a subset of features could maintain high accuracy compared to the full model. Our findings showed that selecting fewer features, such as half or even a quarter of the variables, still yielded strong results, emphasizing the importance of careful feature selection in cervical cancer prediction. The RF algorithm achieved the highest accuracy, with 99% using the full feature set and 98% with a reduced set of five features. Notably, diagnosis and hormonal contraceptives were identified as significant predictors. Hormonal contraceptives, which can affect cervical health, are linked to increased risks of HPV infection and cervical cancer. This study highlights the role of SpFSR in improving prediction models and suggests that external validation is necessary to confirm our findings in diverse populations. Further research should explore additional datasets and variables not covered in this study, as well as the model’s practical applicability in clinical settings.
期刊介绍:
Human Behavior and Emerging Technologies is an interdisciplinary journal dedicated to publishing high-impact research that enhances understanding of the complex interactions between diverse human behavior and emerging digital technologies.