Victor Lee, Nicholas S Moore, Joshua Doyle, Daniel Hicks, Patrick Oh, Shari Bodofsky, Sajid Hossain, Abhijit A Patel, Sanjay Aneja, Robert Homer, Henry S Park
{"title":"利用原发肿瘤体细胞突变数据预测非小细胞肺癌淋巴结转移。","authors":"Victor Lee, Nicholas S Moore, Joshua Doyle, Daniel Hicks, Patrick Oh, Shari Bodofsky, Sajid Hossain, Abhijit A Patel, Sanjay Aneja, Robert Homer, Henry S Park","doi":"10.1200/CCI-24-00303","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Lymph node metastasis (LNM) significantly affects prognosis and treatment strategies in non-small cell lung cancer (NSCLC). Current diagnostic methods, including imaging and histopathology, have limited sensitivity and specificity. This study aims to develop and evaluate machine learning (ML) models that predict LNM in NSCLC using single-nucleotide polymorphism (SNP) data from The Cancer Genome Atlas.</p><p><strong>Methods: </strong>A cohort of 542 patients with NSCLC with comprehensive SNP data were analyzed. After preprocessing, feature selection was performed using chi-square tests to identify SNPs significantly associated with LNM. Twelve ML models, including Logistic Regression, Naive Bayes, and Support Vector Machines, were trained and evaluated using bootstrapped data sets. Model performance was assessed using metrics such as accuracy, area under the receiver operating characteristic curve (AUC), and F1 score. Shapley additive explanations values were used for feature interpretability, and survival analysis was conducted to assess clinical outcomes.</p><p><strong>Results: </strong>Naive Bayes and Logistic Regression models achieved the highest predictive performance, with median AUCs of 0.93 and 0.91, respectively. Key SNPs, including mutations in <i>TANC2</i>, <i>KCNT2</i>, and <i>CENPF</i>, were consistently identified as predictive features. Survival analysis demonstrated significant differences in outcomes on the basis of model-predicted LNM status (log-rank <i>P</i> = .0268). Feature selection improved model accuracy and robustness, highlighting the biological relevance of selected SNPs.</p><p><strong>Conclusion: </strong>ML models leveraging primary tumor SNP data can enhance LNM prediction in NSCLC, outperforming traditional diagnostic methods. These findings underscore the potential of integrating genomics and ML to develop noninvasive biomarkers, enabling precise risk stratification and personalized treatment strategies in oncology.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400303"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of Lymph Node Metastasis in Non-Small Cell Lung Carcinoma Using Primary Tumor Somatic Mutation Data.\",\"authors\":\"Victor Lee, Nicholas S Moore, Joshua Doyle, Daniel Hicks, Patrick Oh, Shari Bodofsky, Sajid Hossain, Abhijit A Patel, Sanjay Aneja, Robert Homer, Henry S Park\",\"doi\":\"10.1200/CCI-24-00303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Lymph node metastasis (LNM) significantly affects prognosis and treatment strategies in non-small cell lung cancer (NSCLC). Current diagnostic methods, including imaging and histopathology, have limited sensitivity and specificity. This study aims to develop and evaluate machine learning (ML) models that predict LNM in NSCLC using single-nucleotide polymorphism (SNP) data from The Cancer Genome Atlas.</p><p><strong>Methods: </strong>A cohort of 542 patients with NSCLC with comprehensive SNP data were analyzed. After preprocessing, feature selection was performed using chi-square tests to identify SNPs significantly associated with LNM. Twelve ML models, including Logistic Regression, Naive Bayes, and Support Vector Machines, were trained and evaluated using bootstrapped data sets. Model performance was assessed using metrics such as accuracy, area under the receiver operating characteristic curve (AUC), and F1 score. Shapley additive explanations values were used for feature interpretability, and survival analysis was conducted to assess clinical outcomes.</p><p><strong>Results: </strong>Naive Bayes and Logistic Regression models achieved the highest predictive performance, with median AUCs of 0.93 and 0.91, respectively. Key SNPs, including mutations in <i>TANC2</i>, <i>KCNT2</i>, and <i>CENPF</i>, were consistently identified as predictive features. Survival analysis demonstrated significant differences in outcomes on the basis of model-predicted LNM status (log-rank <i>P</i> = .0268). Feature selection improved model accuracy and robustness, highlighting the biological relevance of selected SNPs.</p><p><strong>Conclusion: </strong>ML models leveraging primary tumor SNP data can enhance LNM prediction in NSCLC, outperforming traditional diagnostic methods. These findings underscore the potential of integrating genomics and ML to develop noninvasive biomarkers, enabling precise risk stratification and personalized treatment strategies in oncology.</p>\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":\"9 \",\"pages\":\"e2400303\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI-24-00303\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/5/30 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
目的:淋巴结转移(LNM)对非小细胞肺癌(NSCLC)的预后和治疗策略有显著影响。目前的诊断方法,包括影像学和组织病理学,具有有限的敏感性和特异性。本研究旨在开发和评估机器学习(ML)模型,利用来自癌症基因组图谱的单核苷酸多态性(SNP)数据预测NSCLC的LNM。方法:对542例具有综合SNP数据的NSCLC患者进行队列分析。预处理后,使用卡方检验进行特征选择,以识别与LNM显著相关的snp。12个ML模型,包括逻辑回归、朴素贝叶斯和支持向量机,使用自举数据集进行训练和评估。模型性能的评估指标包括准确性、受试者工作特征曲线下面积(AUC)和F1评分。沙普利加性解释值用于特征可解释性,生存分析用于评估临床结果。结果:朴素贝叶斯和逻辑回归模型的预测性能最高,中位auc分别为0.93和0.91。关键snp,包括TANC2、KCNT2和CENPF的突变,被一致地确定为预测特征。生存分析显示,基于模型预测的LNM状态,结果存在显著差异(log-rank P = 0.0268)。特征选择提高了模型的准确性和鲁棒性,突出了所选snp的生物学相关性。结论:利用原发肿瘤SNP数据的ML模型可以增强对非小细胞肺癌LNM的预测,优于传统的诊断方法。这些发现强调了整合基因组学和ML开发无创生物标志物的潜力,从而实现肿瘤精确的风险分层和个性化治疗策略。
Prediction of Lymph Node Metastasis in Non-Small Cell Lung Carcinoma Using Primary Tumor Somatic Mutation Data.
Purpose: Lymph node metastasis (LNM) significantly affects prognosis and treatment strategies in non-small cell lung cancer (NSCLC). Current diagnostic methods, including imaging and histopathology, have limited sensitivity and specificity. This study aims to develop and evaluate machine learning (ML) models that predict LNM in NSCLC using single-nucleotide polymorphism (SNP) data from The Cancer Genome Atlas.
Methods: A cohort of 542 patients with NSCLC with comprehensive SNP data were analyzed. After preprocessing, feature selection was performed using chi-square tests to identify SNPs significantly associated with LNM. Twelve ML models, including Logistic Regression, Naive Bayes, and Support Vector Machines, were trained and evaluated using bootstrapped data sets. Model performance was assessed using metrics such as accuracy, area under the receiver operating characteristic curve (AUC), and F1 score. Shapley additive explanations values were used for feature interpretability, and survival analysis was conducted to assess clinical outcomes.
Results: Naive Bayes and Logistic Regression models achieved the highest predictive performance, with median AUCs of 0.93 and 0.91, respectively. Key SNPs, including mutations in TANC2, KCNT2, and CENPF, were consistently identified as predictive features. Survival analysis demonstrated significant differences in outcomes on the basis of model-predicted LNM status (log-rank P = .0268). Feature selection improved model accuracy and robustness, highlighting the biological relevance of selected SNPs.
Conclusion: ML models leveraging primary tumor SNP data can enhance LNM prediction in NSCLC, outperforming traditional diagnostic methods. These findings underscore the potential of integrating genomics and ML to develop noninvasive biomarkers, enabling precise risk stratification and personalized treatment strategies in oncology.