Assessing the feasibility and external validity of natural language processing-extracted data for advanced lung cancer patients.

IF 4.5 2区 医学 Q1 ONCOLOGY
Yuchen Li, Jennifer Law, Lisa W Le, Janice J N Li, Christopher Pettengell, Patricia Demarco, Michael Duong, David Merritt, Sean Davidson, Mike Sung, Qixuan Li, Sally Cm Lau, Sajda Zahir, Ryan Chu, Malcom Ryan, Khizar Karim, Josh Morganstein, Adrian Sacher, Lawson Eng, Frances A Shepherd, Penelope Bradbury, Geoffrey Liu, Natasha B Leighl
{"title":"Assessing the feasibility and external validity of natural language processing-extracted data for advanced lung cancer patients.","authors":"Yuchen Li, Jennifer Law, Lisa W Le, Janice J N Li, Christopher Pettengell, Patricia Demarco, Michael Duong, David Merritt, Sean Davidson, Mike Sung, Qixuan Li, Sally Cm Lau, Sajda Zahir, Ryan Chu, Malcom Ryan, Khizar Karim, Josh Morganstein, Adrian Sacher, Lawson Eng, Frances A Shepherd, Penelope Bradbury, Geoffrey Liu, Natasha B Leighl","doi":"10.1016/j.lungcan.2025.108080","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.</p><p><strong>Methods: </strong>Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWEN<sup>TM</sup>, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.</p><p><strong>Result: </strong>NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.</p><p><strong>Conclusion: </strong>Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.</p>","PeriodicalId":18129,"journal":{"name":"Lung Cancer","volume":"199 ","pages":"108080"},"PeriodicalIF":4.5000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lung Cancer","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.lungcan.2025.108080","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.

Methods: Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWENTM, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.

Result: NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.

Conclusion: Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.

评估自然语言处理提取数据对晚期肺癌患者的可行性和外部有效性。
背景:人工提取真实世界的临床数据用于研究可能耗时且容易出错。我们评估了使用自然语言处理(NLP)(一种人工智能技术)为晚期肺癌(aLC)患者自动提取数据的可行性。通过将我们的发现与文献报道的结果进行比较,我们评估了nlp提取数据的外部有效性。方法:纳入2015年1月至2017年12月在玛格丽特公主癌症中心诊断为IIIB或IV期肺癌的患者,这些患者接受了至少一剂全身治疗。他们的电子健康记录于2019年3月提供给Pentavere的NLP平台DARWENTM。描述性统计总结了基线患者和癌症特征、分子生物标志物和一线全身治疗。采用Cox多变量模型评价晚期非小细胞肺癌(NSCLC)和小细胞肺癌(SCLC)队列的预后因素。结果:NLP共在8小时内提取了临床信息(n = 333例),仅缺失了少量吸烟状态(n = 2)和东部肿瘤合作组(ECOG)状态(n = 5)的数据。从NLP提取的数据中总结的基线患者和癌症特征与既往研究和人群报告相当。对于非小细胞肺癌患者,男性(HR 1.44, 95% CI[1.04, 2.00])、较差的ECOG(1.48[1.22, 1.81])、肝脏(2.24[1.45,3.46])、骨骼(2.09[1.48,2.96])或肺转移(2.54[1.05,2.26])与较差的生存结果相关。对于SCLC患者,年龄较大(HR为1.70 / 10年,95% CI[1.10, 2.63])和肝转移(3.81[1.61,9.01])与较差的生存结果相关。结论:本研究证明了使用自然语言处理进行数据自动提取是可行且省时的。此外,nlp提取的数据可用于确定有效和有用的临床研究终点。NLP在加速提取现实世界数据以用于未来观察研究方面具有巨大潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Lung Cancer
Lung Cancer 医学-呼吸系统
CiteScore
9.40
自引率
3.80%
发文量
407
审稿时长
25 days
期刊介绍: Lung Cancer is an international publication covering the clinical, translational and basic science of malignancies of the lung and chest region.Original research articles, early reports, review articles, editorials and correspondence covering the prevention, epidemiology and etiology, basic biology, pathology, clinical assessment, surgery, chemotherapy, radiotherapy, combined treatment modalities, other treatment modalities and outcomes of lung cancer are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信