Validation of Non-Small Cell Lung Cancer Clinical Insights Using a Generalized Oncology Natural Language Processing Model.

IF 3.3 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2024-09-01 DOI:10.1200/CCI.23.00099

Rachel C Kenney, Xiaoren Chen, Kazuki Shintani, Clara Gagnon, John Liu, Stacey DaCosta Byfield, Lorre Ochs, Anne-Marie Currie

{"title":"Validation of Non-Small Cell Lung Cancer Clinical Insights Using a Generalized Oncology Natural Language Processing Model.","authors":"Rachel C Kenney, Xiaoren Chen, Kazuki Shintani, Clara Gagnon, John Liu, Stacey DaCosta Byfield, Lorre Ochs, Anne-Marie Currie","doi":"10.1200/CCI.23.00099","DOIUrl":null,"url":null,"abstract":"Purpose: Limited studies have used natural language processing (NLP) in the context of non-small cell lung cancer (NSCLC). This study aimed to validate the application of an NLP model to an NSCLC cohort by extracting NSCLC concepts from free-text medical notes and converting them to structured, interpretable data.Methods: Patients with a lung neoplasm, NSCLC histology, and treatment information in their notes were selected from a repository of over 27 million patients. From these, 200 were randomly selected for this study with the longest and the most recent note included for each patient. An NLP model developed and validated on a large solid and blood cancer oncology cohort was applied to this NSCLC cohort. Two certified tumor registrars and a curator abstracted concepts from the notes: neoplasm, histology, stage, TNM values, and metastasis sites. This manually abstracted gold standard was compared with the NLP model output. Precision and recall scores were calculated.Results: The NLP model extracted the NSCLC concepts with excellent precision and recall with the following scores, respectively: Lung neoplasm 100% and 100%, NSCLC histology 99% and 88%, histology correctly linked to neoplasm 98% and 79%, stage value 98.8% and 92%, stage TNM value 93% and 98%, and metastasis site 97% and 89%. High precision is related to a low number of false positives, and therefore, extracted concepts are likely accurate. High recall indicates that the model captured most of the desired concepts.Conclusion: This study validates that Optum's oncology NLP model has high precision and recall with clinical real-world data and is a reliable model to support research studies and clinical trials. This validation study shows that our nonspecific solid tumor and blood cancer oncology model is generalizable to successfully extract clinical information from specific cancer cohorts.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2300099"},"PeriodicalIF":3.3000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.23.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Limited studies have used natural language processing (NLP) in the context of non-small cell lung cancer (NSCLC). This study aimed to validate the application of an NLP model to an NSCLC cohort by extracting NSCLC concepts from free-text medical notes and converting them to structured, interpretable data.

Methods: Patients with a lung neoplasm, NSCLC histology, and treatment information in their notes were selected from a repository of over 27 million patients. From these, 200 were randomly selected for this study with the longest and the most recent note included for each patient. An NLP model developed and validated on a large solid and blood cancer oncology cohort was applied to this NSCLC cohort. Two certified tumor registrars and a curator abstracted concepts from the notes: neoplasm, histology, stage, TNM values, and metastasis sites. This manually abstracted gold standard was compared with the NLP model output. Precision and recall scores were calculated.

Results: The NLP model extracted the NSCLC concepts with excellent precision and recall with the following scores, respectively: Lung neoplasm 100% and 100%, NSCLC histology 99% and 88%, histology correctly linked to neoplasm 98% and 79%, stage value 98.8% and 92%, stage TNM value 93% and 98%, and metastasis site 97% and 89%. High precision is related to a low number of false positives, and therefore, extracted concepts are likely accurate. High recall indicates that the model captured most of the desired concepts.

Conclusion: This study validates that Optum's oncology NLP model has high precision and recall with clinical real-world data and is a reliable model to support research studies and clinical trials. This validation study shows that our nonspecific solid tumor and blood cancer oncology model is generalizable to successfully extract clinical information from specific cancer cohorts.

查看原文本刊更多论文

使用通用肿瘤学自然语言处理模型验证非小细胞肺癌临床见解。

目的：将自然语言处理（NLP）用于非小细胞肺癌（NSCLC）的研究非常有限。本研究旨在通过从自由文本医疗笔记中提取 NSCLC 概念并将其转换为结构化、可解释的数据，验证 NLP 模型在 NSCLC 队列中的应用：从超过 2700 万名患者的资料库中选取了笔记中包含肺部肿瘤、NSCLC 组织学和治疗信息的患者。从这些患者中随机抽取 200 名患者进行研究，每名患者都包含最长和最近的病历。我们将在大型实体肿瘤和血液肿瘤队列中开发和验证的 NLP 模型应用于 NSCLC 队列。两名经过认证的肿瘤登记员和一名馆长从笔记中抽取了概念：肿瘤、组织学、分期、TNM 值和转移部位。人工抽取的金标准与 NLP 模型输出进行了比较。结果：结果：NLP 模型提取 NSCLC 概念的精确度和召回率非常高，分别达到了以下分数：肺肿瘤 100%和 100%，NSCLC 组织学 99%和 88%，组织学与肿瘤正确关联 98%和 79%，分期值 98.8%和 92%，分期 TNM 值 93%和 98%，转移部位 97%和 89%。高精确度与低误报率有关，因此提取的概念很可能是准确的。高召回率表明模型捕捉到了大部分所需的概念：本研究验证了 Optum 的肿瘤学 NLP 模型在临床实际数据中具有较高的精确度和召回率，是支持研究和临床试验的可靠模型。这项验证研究表明，我们的非特异性实体肿瘤和血液肿瘤肿瘤学模型具有通用性，可以成功地从特定的癌症队列中提取临床信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190