Kyeryoung Lee, Zongzhi Liu, Qing Huang, David Corrigan, Iftekhar Kalsekar, Tomi Jun, Gustavo Stolovitzky, William K Oh, Ravi Rajaram, Xiaoyan Wang
{"title":"解码早期和局部区域晚期非小细胞肺癌的复发:来自电子健康记录和自然语言处理的见解。","authors":"Kyeryoung Lee, Zongzhi Liu, Qing Huang, David Corrigan, Iftekhar Kalsekar, Tomi Jun, Gustavo Stolovitzky, William K Oh, Ravi Rajaram, Xiaoyan Wang","doi":"10.1200/CCI-24-00227","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Recurrences after curative resection in early-stage and locoregionally advanced non-small cell lung cancer (NSCLC) are common, necessitating a nuanced understanding of associated risk factors. This study aimed to establish a natural language processing (NLP) system to efficiently curate recurrence data in NSCLC and analyze risk factors longitudinally.</p><p><strong>Patients and methods: </strong>Electronic health records of 6,351 patients with NSCLC with >700,000 notes were obtained from Mount Sinai's data sets. A deep learning-based customized NLP system was developed to identify cohorts experiencing recurrence. Recurrence types and rates over time were stratified by various clinical features. Cohort description analysis, Kaplan-Meier analysis for overall recurrence-free survival (RFS) and distant metastasis-free survival (DMFS), and Cox proportional hazards analysis were performed.</p><p><strong>Results: </strong>Of 1,295 patients with stage I-IIIA NSCLC with surgical resections, 336 patients (25.9%) experienced recurrence, as identified through NLP. The NLP system achieved a precision of 94.3%, a recall of 93%, and an F1 score of 93.5. Among 336 patients, 52.4% had local/regional recurrences, 44% distant metastases, and 3.6% unknown recurrence. RFS rates at years 1-5 were 93%, 81%, 73%, 67%, and 61%, respectively (96%, 89%, 84%, 80%, and 75% for distant metastasis). Stage-specific RFS rates at year 5 were 73% (IA), 62% (IB), 47% (IIA), 46% (IIB), and 20% (IIIA). Stage IB patients had a significantly higher likelihood of recurrence versus stage IA (adjusted hazard ratio [aHR], 1.63; <i>P</i> = .02). The RFS was lower in patients with clinically significant <i>TP53</i> alteration (<i>v</i> <i>TP53</i>-negative or unknown significance), affecting overall RFS (aHR, 1.89; <i>P</i> = .007) and DMFS (aHR, 2.47; <i>P</i> = .009) among stage IA/IB patients.</p><p><strong>Conclusion: </strong>Our scalable NLP system enabled us to generate real-world insights into NSCLC recurrences, paving the way for predictive models for preventing, diagnosing, and treating NSCLC recurrence.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400227"},"PeriodicalIF":3.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12011440/pdf/","citationCount":"0","resultStr":"{\"title\":\"Decoding Recurrence in Early-Stage and Locoregionally Advanced Non-Small Cell Lung Cancer: Insights From Electronic Health Records and Natural Language Processing.\",\"authors\":\"Kyeryoung Lee, Zongzhi Liu, Qing Huang, David Corrigan, Iftekhar Kalsekar, Tomi Jun, Gustavo Stolovitzky, William K Oh, Ravi Rajaram, Xiaoyan Wang\",\"doi\":\"10.1200/CCI-24-00227\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Recurrences after curative resection in early-stage and locoregionally advanced non-small cell lung cancer (NSCLC) are common, necessitating a nuanced understanding of associated risk factors. This study aimed to establish a natural language processing (NLP) system to efficiently curate recurrence data in NSCLC and analyze risk factors longitudinally.</p><p><strong>Patients and methods: </strong>Electronic health records of 6,351 patients with NSCLC with >700,000 notes were obtained from Mount Sinai's data sets. A deep learning-based customized NLP system was developed to identify cohorts experiencing recurrence. Recurrence types and rates over time were stratified by various clinical features. Cohort description analysis, Kaplan-Meier analysis for overall recurrence-free survival (RFS) and distant metastasis-free survival (DMFS), and Cox proportional hazards analysis were performed.</p><p><strong>Results: </strong>Of 1,295 patients with stage I-IIIA NSCLC with surgical resections, 336 patients (25.9%) experienced recurrence, as identified through NLP. The NLP system achieved a precision of 94.3%, a recall of 93%, and an F1 score of 93.5. Among 336 patients, 52.4% had local/regional recurrences, 44% distant metastases, and 3.6% unknown recurrence. RFS rates at years 1-5 were 93%, 81%, 73%, 67%, and 61%, respectively (96%, 89%, 84%, 80%, and 75% for distant metastasis). Stage-specific RFS rates at year 5 were 73% (IA), 62% (IB), 47% (IIA), 46% (IIB), and 20% (IIIA). Stage IB patients had a significantly higher likelihood of recurrence versus stage IA (adjusted hazard ratio [aHR], 1.63; <i>P</i> = .02). The RFS was lower in patients with clinically significant <i>TP53</i> alteration (<i>v</i> <i>TP53</i>-negative or unknown significance), affecting overall RFS (aHR, 1.89; <i>P</i> = .007) and DMFS (aHR, 2.47; <i>P</i> = .009) among stage IA/IB patients.</p><p><strong>Conclusion: </strong>Our scalable NLP system enabled us to generate real-world insights into NSCLC recurrences, paving the way for predictive models for preventing, diagnosing, and treating NSCLC recurrence.</p>\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":\"9 \",\"pages\":\"e2400227\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12011440/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI-24-00227\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00227","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
Decoding Recurrence in Early-Stage and Locoregionally Advanced Non-Small Cell Lung Cancer: Insights From Electronic Health Records and Natural Language Processing.
Purpose: Recurrences after curative resection in early-stage and locoregionally advanced non-small cell lung cancer (NSCLC) are common, necessitating a nuanced understanding of associated risk factors. This study aimed to establish a natural language processing (NLP) system to efficiently curate recurrence data in NSCLC and analyze risk factors longitudinally.
Patients and methods: Electronic health records of 6,351 patients with NSCLC with >700,000 notes were obtained from Mount Sinai's data sets. A deep learning-based customized NLP system was developed to identify cohorts experiencing recurrence. Recurrence types and rates over time were stratified by various clinical features. Cohort description analysis, Kaplan-Meier analysis for overall recurrence-free survival (RFS) and distant metastasis-free survival (DMFS), and Cox proportional hazards analysis were performed.
Results: Of 1,295 patients with stage I-IIIA NSCLC with surgical resections, 336 patients (25.9%) experienced recurrence, as identified through NLP. The NLP system achieved a precision of 94.3%, a recall of 93%, and an F1 score of 93.5. Among 336 patients, 52.4% had local/regional recurrences, 44% distant metastases, and 3.6% unknown recurrence. RFS rates at years 1-5 were 93%, 81%, 73%, 67%, and 61%, respectively (96%, 89%, 84%, 80%, and 75% for distant metastasis). Stage-specific RFS rates at year 5 were 73% (IA), 62% (IB), 47% (IIA), 46% (IIB), and 20% (IIIA). Stage IB patients had a significantly higher likelihood of recurrence versus stage IA (adjusted hazard ratio [aHR], 1.63; P = .02). The RFS was lower in patients with clinically significant TP53 alteration (vTP53-negative or unknown significance), affecting overall RFS (aHR, 1.89; P = .007) and DMFS (aHR, 2.47; P = .009) among stage IA/IB patients.
Conclusion: Our scalable NLP system enabled us to generate real-world insights into NSCLC recurrences, paving the way for predictive models for preventing, diagnosing, and treating NSCLC recurrence.