Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers.

IF 3.5 2区 医学 Q2 ONCOLOGY
Translational lung cancer research Pub Date : 2025-04-30 Epub Date: 2025-04-25 DOI:10.21037/tlcr-24-875
Ning Liu, Xue Li, Xu Luo, Bin Liu, Jie Tang, Fei Xiao, Weiya Wang, Yuan Tang, Pei Shu, Benxia Zhang, Yue Chen, Diyuan Qin, Qizhi Ma, Fuchun Guo, Xiaojun Tang, Daxing Zhu, Jiandong Mei, Weizhi Chen, Dan Li, Lili Jiang, Yongsheng Wang
{"title":"Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers.","authors":"Ning Liu, Xue Li, Xu Luo, Bin Liu, Jie Tang, Fei Xiao, Weiya Wang, Yuan Tang, Pei Shu, Benxia Zhang, Yue Chen, Diyuan Qin, Qizhi Ma, Fuchun Guo, Xiaojun Tang, Daxing Zhu, Jiandong Mei, Weizhi Chen, Dan Li, Lili Jiang, Yongsheng Wang","doi":"10.21037/tlcr-24-875","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Discrimination of multiple non-small cell lung cancers (NSCLCs) as multiple primary lung cancers (MPLCs) or intrapulmonary metastases (IPMs) is critical but remains challenging. The aim of this study is to develop and validate the machine learning (ML) models based on the molecular features for estimating the probability of MPLC or IPM for patients presenting multiple NSCLCs.</p><p><strong>Methods: </strong>A total of 72 multiple NSCLCs patients with 157 surgical resection tumor lesions from January 2012 to January 2018 at two institutions were included for developing and testing models. Specifically, 46 patients with 103 tumors which were defined as definitive MPLC or IPM according to International Association for the Study of Lung Cancer (IASLC) criteria were used to develop models. They were spilt into training and validation sets using stratified random sampling and five-fold cross-validation. The developed models were tested in other 26 patients whose tumors were undetermined by traditional methods. Whole-exome sequencing (WES) was performed on all included tumor samples. Four molecular features were calculated to characterize tumors relatedness and served as model inputs, including genetic divergence, shared mutation number, Pearson correlation coefficient and early mutation number. Decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) were employed, with performance assessed by areas under the curve (AUCs), accuracy, precision, recall, and F1 score in validation set. Disease-free survival (DFS) were used to evaluate model performance in test cohort. Clinical and genetic characteristics were then compared between MPLC and IPM populations.</p><p><strong>Results: </strong>All of the four molecular features showed significant differences between MPLC and IPM patients in development cohort. That is, MPLC exhibited higher genetic divergence, lower shared mutation number, Pearson correlation and early mutation number than IPM (P<0.001). DT model, RF model and GBDT model were developed with these factors and achieved a mean AUC of 0.94 [standard deviation (SD) 0.09], 1.00 (SD 0.00) and 1.00 (SD 0.00) in validation set, respectively. DT model, RF model and GBDT model discriminated the undetermined multiple NSCLCs as MPLC (n=15) and IPM (n=11) consistently. MPLC identified by ML models had significantly prolonged DFS [hazard ratio =0.21; 95% confidence interval (CI): 0.04-1.0; P=0.04] than that of IPM. MPLC patients had a relative higher prevalence of family history of first-degree relatives with cancer, and more than half of these patients reported a family history of lung cancer. EGFR remains the most common mutated driver both in MPLC and IPM populations.</p><p><strong>Conclusions: </strong>ML models based on the molecular features effectively distcriminate primary tumors from metastases in multiple NSCLCs, which improve the accuracy of multiple NSCLCs diagnosis and assist in clinical decision-making, particularly in challenging cases.</p>","PeriodicalId":23271,"journal":{"name":"Translational lung cancer research","volume":"14 4","pages":"1118-1137"},"PeriodicalIF":3.5000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082235/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Translational lung cancer research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/tlcr-24-875","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/25 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Discrimination of multiple non-small cell lung cancers (NSCLCs) as multiple primary lung cancers (MPLCs) or intrapulmonary metastases (IPMs) is critical but remains challenging. The aim of this study is to develop and validate the machine learning (ML) models based on the molecular features for estimating the probability of MPLC or IPM for patients presenting multiple NSCLCs.

Methods: A total of 72 multiple NSCLCs patients with 157 surgical resection tumor lesions from January 2012 to January 2018 at two institutions were included for developing and testing models. Specifically, 46 patients with 103 tumors which were defined as definitive MPLC or IPM according to International Association for the Study of Lung Cancer (IASLC) criteria were used to develop models. They were spilt into training and validation sets using stratified random sampling and five-fold cross-validation. The developed models were tested in other 26 patients whose tumors were undetermined by traditional methods. Whole-exome sequencing (WES) was performed on all included tumor samples. Four molecular features were calculated to characterize tumors relatedness and served as model inputs, including genetic divergence, shared mutation number, Pearson correlation coefficient and early mutation number. Decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) were employed, with performance assessed by areas under the curve (AUCs), accuracy, precision, recall, and F1 score in validation set. Disease-free survival (DFS) were used to evaluate model performance in test cohort. Clinical and genetic characteristics were then compared between MPLC and IPM populations.

Results: All of the four molecular features showed significant differences between MPLC and IPM patients in development cohort. That is, MPLC exhibited higher genetic divergence, lower shared mutation number, Pearson correlation and early mutation number than IPM (P<0.001). DT model, RF model and GBDT model were developed with these factors and achieved a mean AUC of 0.94 [standard deviation (SD) 0.09], 1.00 (SD 0.00) and 1.00 (SD 0.00) in validation set, respectively. DT model, RF model and GBDT model discriminated the undetermined multiple NSCLCs as MPLC (n=15) and IPM (n=11) consistently. MPLC identified by ML models had significantly prolonged DFS [hazard ratio =0.21; 95% confidence interval (CI): 0.04-1.0; P=0.04] than that of IPM. MPLC patients had a relative higher prevalence of family history of first-degree relatives with cancer, and more than half of these patients reported a family history of lung cancer. EGFR remains the most common mutated driver both in MPLC and IPM populations.

Conclusions: ML models based on the molecular features effectively distcriminate primary tumors from metastases in multiple NSCLCs, which improve the accuracy of multiple NSCLCs diagnosis and assist in clinical decision-making, particularly in challenging cases.

基于分子特征的机器学习模型的开发和验证,用于估计多发非小细胞肺癌患者多发原发肺癌与肺内转移的概率。
背景:将多发性非小细胞肺癌(nsclc)区分为多发性原发肺癌(MPLCs)或肺内转移瘤(IPMs)至关重要,但仍具有挑战性。本研究的目的是开发和验证基于分子特征的机器学习(ML)模型,用于估计多发性非小细胞肺癌患者MPLC或IPM的概率。方法:选取2012年1月至2018年1月在两所医院进行手术切除肿瘤病灶的72例多发性非小细胞肺癌患者,共157例进行模型的开发和测试。具体而言,根据国际肺癌研究协会(IASLC)的标准,46例103例肿瘤被定义为明确的MPLC或IPM,用于开发模型。使用分层随机抽样和五倍交叉验证将它们分成训练集和验证集。该模型在另外26例未用传统方法确定肿瘤的患者中进行了测试。对所有纳入的肿瘤样本进行全外显子组测序(WES)。计算四个分子特征来表征肿瘤相关性,并作为模型输入,包括遗传差异、共享突变数、Pearson相关系数和早期突变数。采用决策树(DT)、随机森林(RF)和梯度增强决策树(GBDT),通过曲线下面积(aus)、准确度、精密度、召回率和验证集的F1分数来评估其性能。用无病生存期(DFS)评价模型在试验队列中的表现。然后比较MPLC和IPM人群的临床和遗传特征。结果:在发展队列中,MPLC和IPM患者的4个分子特征均有显著差异。即MPLC比IPM表现出更高的遗传差异、更低的共享突变数、Pearson相关性和早期突变数(p7)。结论:基于分子特征的ML模型可以有效区分多发nsclc的原发肿瘤和转移瘤,提高多发nsclc诊断的准确性,有助于临床决策,特别是在挑战性病例中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
2.50%
发文量
137
期刊介绍: Translational Lung Cancer Research(TLCR, Transl Lung Cancer Res, Print ISSN 2218-6751; Online ISSN 2226-4477) is an international, peer-reviewed, open-access journal, which was founded in March 2012. TLCR is indexed by PubMed/PubMed Central and the Chemical Abstracts Service (CAS) Databases. It is published quarterly the first year, and published bimonthly since February 2013. It provides practical up-to-date information on prevention, early detection, diagnosis, and treatment of lung cancer. Specific areas of its interest include, but not limited to, multimodality therapy, markers, imaging, tumor biology, pathology, chemoprevention, and technical advances related to lung cancer.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信