Gowri Sivaramakrishnan, Kannan Sridharan, Mohammed Abdulla AlMuharraqi
{"title":"DNA Methylation-Based Machine Learning Models for Classification of Oral Cancer and Potentially Malignant Lesions: A Proof-Of-Concept Study.","authors":"Gowri Sivaramakrishnan, Kannan Sridharan, Mohammed Abdulla AlMuharraqi","doi":"10.1016/j.jormas.2025.102594","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate classification of oral squamous cell carcinoma (OSCC) and oral potentially malignant lesions (OPLs) is challenging due to histopathological variability and limited predictive biomarkers. DNA methylation offers a promising molecular signature, but its utility for tissue classification remains underexplored.</p><p><strong>Methods: </strong>We harmonized publicly available DNA methylation datasets (GSE97784 and GSE204943; n = 142) and selected the top 100 most variable CpG sites (variance 0.074-0.117) for analysis. Eight supervised machine learning (ML) models-logistic regression, random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), k-nearest neighbors (kNN), Naive Bayes, gradient boosting machine (GBM), and neural network (NN)-were trained using 10-fold cross-validation. Principal component analysis was performed to assess data dimensionality.</p><p><strong>Results: </strong>High-variance CpG sites were predominantly located within gene bodies and clustered on chromosomes 1, 2, and 6. PCA revealed complex, high-dimensional methylation patterns requiring 55 components to capture 90% of variance. Overall, RF achieved the highest accuracy (78%) and AUC-ROC (0.84), followed by GBM (76%) and XGBoost. Tumor and normal tissues were classified with relatively high sensitivity and specificity, while OPLs were difficult to detect, showing low sensitivity (<50%) across all models. GBM performed best for normal tissue detection, and Naive Bayes slightly outperformed for tumor F1-score, but RF offered the most balanced performance across classes.</p><p><strong>Conclusions: </strong>Ensemble ML models, particularly RF and GBM, demonstrate proof-of-concept potential for DNA methylation-based classification of oral tissues. While tumor and normal classification is robust, OPL detection remains challenging, highlighting the need for larger, balanced datasets and complementary biomarkers to improve early detection and clinical utility.</p>","PeriodicalId":56038,"journal":{"name":"Journal of Stomatology Oral and Maxillofacial Surgery","volume":" ","pages":"102594"},"PeriodicalIF":2.0000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Stomatology Oral and Maxillofacial Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jormas.2025.102594","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Dentistry","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Accurate classification of oral squamous cell carcinoma (OSCC) and oral potentially malignant lesions (OPLs) is challenging due to histopathological variability and limited predictive biomarkers. DNA methylation offers a promising molecular signature, but its utility for tissue classification remains underexplored.
Methods: We harmonized publicly available DNA methylation datasets (GSE97784 and GSE204943; n = 142) and selected the top 100 most variable CpG sites (variance 0.074-0.117) for analysis. Eight supervised machine learning (ML) models-logistic regression, random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), k-nearest neighbors (kNN), Naive Bayes, gradient boosting machine (GBM), and neural network (NN)-were trained using 10-fold cross-validation. Principal component analysis was performed to assess data dimensionality.
Results: High-variance CpG sites were predominantly located within gene bodies and clustered on chromosomes 1, 2, and 6. PCA revealed complex, high-dimensional methylation patterns requiring 55 components to capture 90% of variance. Overall, RF achieved the highest accuracy (78%) and AUC-ROC (0.84), followed by GBM (76%) and XGBoost. Tumor and normal tissues were classified with relatively high sensitivity and specificity, while OPLs were difficult to detect, showing low sensitivity (<50%) across all models. GBM performed best for normal tissue detection, and Naive Bayes slightly outperformed for tumor F1-score, but RF offered the most balanced performance across classes.
Conclusions: Ensemble ML models, particularly RF and GBM, demonstrate proof-of-concept potential for DNA methylation-based classification of oral tissues. While tumor and normal classification is robust, OPL detection remains challenging, highlighting the need for larger, balanced datasets and complementary biomarkers to improve early detection and clinical utility.
背景:由于组织病理学变异性和有限的预测性生物标志物,口腔鳞状细胞癌(OSCC)和口腔潜在恶性病变(opl)的准确分类具有挑战性。DNA甲基化提供了一种很有前途的分子标记,但其在组织分类方面的应用仍未得到充分探索。方法:我们统一了公开的DNA甲基化数据集(GSE97784和GSE204943; n = 142),并选择了前100个变化最大的CpG位点(方差0.074-0.117)进行分析。八个监督机器学习(ML)模型-逻辑回归,随机森林(RF),支持向量机(SVM),极端梯度增强(XGBoost), k近邻(kNN),朴素贝叶斯,梯度增强机(GBM)和神经网络(NN)-使用10倍交叉验证进行训练。主成分分析评估数据维度。结果:高变异CpG位点主要位于基因体内,聚集在1、2和6号染色体上。PCA揭示了复杂的高维甲基化模式,需要55个成分来捕获90%的方差。总体而言,RF达到了最高的准确度(78%)和AUC-ROC(0.84),其次是GBM(76%)和XGBoost。肿瘤和正常组织的分类具有相对较高的灵敏度和特异性,而opl难以检测,灵敏度较低(结论:集合ML模型,特别是RF和GBM,证明了基于DNA甲基化的口腔组织分类的概念验证潜力。虽然肿瘤和正常分类是稳健的,但OPL检测仍然具有挑战性,强调需要更大,平衡的数据集和互补的生物标志物,以提高早期检测和临床应用。
期刊介绍:
J Stomatol Oral Maxillofac Surg publishes research papers and techniques - (guest) editorials, original articles, reviews, technical notes, case reports, images, letters to the editor, guidelines - dedicated to enhancing surgical expertise in all fields relevant to oral and maxillofacial surgery: from plastic and reconstructive surgery of the face, oral surgery and medicine, … to dentofacial and maxillofacial orthopedics.
Original articles include clinical or laboratory investigations and clinical or equipment reports. Reviews include narrative reviews, systematic reviews and meta-analyses.
All manuscripts submitted to the journal are subjected to peer review by international experts, and must:
Be written in excellent English, clear and easy to understand, precise and concise;
Bring new, interesting, valid information - and improve clinical care or guide future research;
Be solely the work of the author(s) stated;
Not have been previously published elsewhere and not be under consideration by another journal;
Be in accordance with the journal''s Guide for Authors'' instructions: manuscripts that fail to comply with these rules may be returned to the authors without being reviewed.
Under no circumstances does the journal guarantee publication before the editorial board makes its final decision.
The journal is indexed in the main international databases and is accessible worldwide through the ScienceDirect and ClinicalKey Platforms.