[Construction and preliminary validation of machine learning predictive models for cervical cancer screening based on human DNA methylation].

Q3 Medicine

中华肿瘤杂志 Pub Date : 2025-02-23 DOI:10.3760/cma.j.cn112152-20230925-00156

Y Yang, H Zhou, Y K Wang, Y Dai, R J Pi, H Zhang, Z Y Huang, T Wu, J H Yang, W Chen

{"title":"[Construction and preliminary validation of machine learning predictive models for cervical cancer screening based on human DNA methylation].","authors":"Y Yang, H Zhou, Y K Wang, Y Dai, R J Pi, H Zhang, Z Y Huang, T Wu, J H Yang, W Chen","doi":"10.3760/cma.j.cn112152-20230925-00156","DOIUrl":null,"url":null,"abstract":"Objective: Using methylation characteristics of human genes to construct machine learning predictive models for screening cervical cancer and precancerous lesions. Methods: Human DNA methylation detection was performed on 224 cervical exfoliated cell specimens from the Cancer Hospital of the Chinese Academy of Medical Sciences, Tianjin Central Hospital of Gynecology Obstetrics, Xinmi Maternal and Child Health Hospital of Henan Province, West China Second Affiliated Hospital of Sichuan University, and Heping Hospital Affiliated to Changzhi Medical College collected during April 2014 and March 2015. The hypermethylated gene fragments related to cervical cancer were selected by high-density, high-association, and hypermethylated gene fragment screening and the LASSO regression algorithm. Taking cervical intraepithelial neoplasia grade 2 (CIN2) or more severe lesions as the research outcome, machine learning predictive models based on the random forest (RF), naive Bayes (NB), and support vector machine (SVM) algorithm, respectively, were constructed. A total of 144 outpatient specimens were used as the training set and 80 cervical exfoliated cell specimens from women participating in the cervical cancer screening program were used as the test set to verify the predictive models. Using histological diagnosis results as the gold standard, the detection efficacy for CIN2 or more severe lesions of the three machine learning predictive models were compared with that of the human papilloma virus (HPV) detection and cytological diagnosis. Results: In the training set of 144 cases, there were 34 cases of HPV positivity, with a positive rate of 23.61%. Cytologically, there were 37 cases diagnosed as no intraepithelial lesion or malignancy (NILM), and 107 cases diagnosed as atypical squamous cells of undetermined significance (ASC-US) or above. Histologically, there were 28 cases without cervical intraepithelial neoplasia or benign cervical lesions, 31 cases of CIN1, 18 cases of CIN2, 31 cases of CIN3, and 36 cases of squamous cell carcinoma. Seven hypermethylated gene fragments were selected from 45 genes, and three machine learning prediction models based on the RF, NB, and SVM algorithm, respectively, were constructed. In the validation set of 80 cases, there were 28 cases of HPV positivity, with a positive rate of 35.00%. Cytologically, there were 65 cases diagnosed as NILM and 15 cases as ASC-US or above. Histologically, there were 39 cases without cervical intraepithelial neoplasia or benign cervical lesions, 10 cases of CIN1, 10 cases of CIN2, 11 cases of CIN3, and 10 cases of squamous cell carcinoma. In the validation set, the area under the curve (AUC) values of the RF model, NB model, SVM model, HPV detection, and cytological diagnosis of CIN2 or above were 0.90, 0.88, 0.82, 0.68, and 0.45, respectively. The DeLong test showed that there was no statistically significant difference in the AUC values between the RF, NB, and SVM models (all P＞0.05), and the AUC values of the RF and NB models were higher than that of HPV detection (both P＜0.01), and the AUC values of the RF, NB, and SVM models were higher than that of cytological diagnosis (all P＜0.01). Compared with the NB model, the sensitivity of the RF model was similar (80.65% vs. 77.42%), but the specificity of the NB model was much higher than that of the RF model (93.88% vs. 73.47%). Conclusion: Among the machine learning prediction models for cervical cancer and precancerous lesions constructed based on human DNA methylation, the NB model has good predictive performance for CIN2 and above lesions, and may be used for screening of cervical cancer and precancerous lesions.","PeriodicalId":39868,"journal":{"name":"中华肿瘤杂志","volume":"47 2","pages":"193-200"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"中华肿瘤杂志","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3760/cma.j.cn112152-20230925-00156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Using methylation characteristics of human genes to construct machine learning predictive models for screening cervical cancer and precancerous lesions. Methods: Human DNA methylation detection was performed on 224 cervical exfoliated cell specimens from the Cancer Hospital of the Chinese Academy of Medical Sciences, Tianjin Central Hospital of Gynecology Obstetrics, Xinmi Maternal and Child Health Hospital of Henan Province, West China Second Affiliated Hospital of Sichuan University, and Heping Hospital Affiliated to Changzhi Medical College collected during April 2014 and March 2015. The hypermethylated gene fragments related to cervical cancer were selected by high-density, high-association, and hypermethylated gene fragment screening and the LASSO regression algorithm. Taking cervical intraepithelial neoplasia grade 2 (CIN2) or more severe lesions as the research outcome, machine learning predictive models based on the random forest (RF), naive Bayes (NB), and support vector machine (SVM) algorithm, respectively, were constructed. A total of 144 outpatient specimens were used as the training set and 80 cervical exfoliated cell specimens from women participating in the cervical cancer screening program were used as the test set to verify the predictive models. Using histological diagnosis results as the gold standard, the detection efficacy for CIN2 or more severe lesions of the three machine learning predictive models were compared with that of the human papilloma virus (HPV) detection and cytological diagnosis. Results: In the training set of 144 cases, there were 34 cases of HPV positivity, with a positive rate of 23.61%. Cytologically, there were 37 cases diagnosed as no intraepithelial lesion or malignancy (NILM), and 107 cases diagnosed as atypical squamous cells of undetermined significance (ASC-US) or above. Histologically, there were 28 cases without cervical intraepithelial neoplasia or benign cervical lesions, 31 cases of CIN1, 18 cases of CIN2, 31 cases of CIN3, and 36 cases of squamous cell carcinoma. Seven hypermethylated gene fragments were selected from 45 genes, and three machine learning prediction models based on the RF, NB, and SVM algorithm, respectively, were constructed. In the validation set of 80 cases, there were 28 cases of HPV positivity, with a positive rate of 35.00%. Cytologically, there were 65 cases diagnosed as NILM and 15 cases as ASC-US or above. Histologically, there were 39 cases without cervical intraepithelial neoplasia or benign cervical lesions, 10 cases of CIN1, 10 cases of CIN2, 11 cases of CIN3, and 10 cases of squamous cell carcinoma. In the validation set, the area under the curve (AUC) values of the RF model, NB model, SVM model, HPV detection, and cytological diagnosis of CIN2 or above were 0.90, 0.88, 0.82, 0.68, and 0.45, respectively. The DeLong test showed that there was no statistically significant difference in the AUC values between the RF, NB, and SVM models (all P＞0.05), and the AUC values of the RF and NB models were higher than that of HPV detection (both P＜0.01), and the AUC values of the RF, NB, and SVM models were higher than that of cytological diagnosis (all P＜0.01). Compared with the NB model, the sensitivity of the RF model was similar (80.65% vs. 77.42%), but the specificity of the NB model was much higher than that of the RF model (93.88% vs. 73.47%). Conclusion: Among the machine learning prediction models for cervical cancer and precancerous lesions constructed based on human DNA methylation, the NB model has good predictive performance for CIN2 and above lesions, and may be used for screening of cervical cancer and precancerous lesions.

查看原文本刊更多论文

[基于人DNA甲基化的宫颈癌筛查机器学习预测模型构建及初步验证]。

目的：利用人类基因甲基化特征构建宫颈癌及癌前病变筛查的机器学习预测模型。方法：对2014年4月至2015年3月在中国医学科学院肿瘤医院、天津市中心妇产医院、河南省新密市妇幼保健院、四川大学华西第二附属医院、长治医学院附属和平医院采集的224例宫颈脱落细胞标本进行人DNA甲基化检测。采用高密度、高关联、高甲基化基因片段筛选和LASSO回归算法筛选宫颈癌相关高甲基化基因片段。以宫颈上皮内瘤变2级（CIN2级）及以上严重病变为研究结果，分别构建基于随机森林（RF）、朴素贝叶斯（NB）和支持向量机（SVM）算法的机器学习预测模型。144例门诊标本作为训练集，80例宫颈癌筛查妇女宫颈脱落细胞标本作为测试集，验证预测模型。以组织学诊断结果为金标准，比较三种机器学习预测模型对CIN2及以上严重病变的检测效果与人乳头瘤病毒（HPV）检测和细胞学诊断的检测效果。结果：144例训练集中，HPV阳性34例，阳性率为23.61%。细胞学上，37例诊断为无上皮内病变或恶性肿瘤（NILM）， 107例诊断为不确定意义的非典型鳞状细胞（ASC-US）或以上。组织学上无宫颈上皮内瘤变或宫颈良性病变28例，CIN1 31例，CIN2 18例，CIN3 31例，鳞状细胞癌36例。从45个基因中选取7个高甲基化基因片段，分别构建基于RF、NB和SVM算法的机器学习预测模型。80例验证集中，HPV阳性28例，阳性率为35.00%。细胞学检查诊断为NILM 65例，ASC-US及以上15例。组织学上无宫颈上皮内瘤变或宫颈良性病变39例，CIN1型10例，CIN2型10例，CIN3型11例，鳞状细胞癌10例。在验证集中，RF模型、NB模型、SVM模型、HPV检测、CIN2及以上细胞学诊断的曲线下面积（AUC）值分别为0.90、0.88、0.82、0.68、0.45。DeLong检验显示，RF、NB和SVM模型的AUC值差异无统计学意义（P< 0.05）， RF和NB模型的AUC值高于HPV检测（P均<0.01），RF、NB和SVM模型的AUC值高于细胞学诊断（P均<0.01）。与NB模型相比，RF模型的敏感性相近（80.65% vs. 77.42%），但NB模型的特异性远高于RF模型（93.88% vs. 73.47%）。结论：在基于人DNA甲基化构建的宫颈癌及癌前病变机器学习预测模型中，NB模型对CIN2及以上病变具有较好的预测效果，可用于宫颈癌及癌前病变的筛查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊