使用语言模型从非结构化肿瘤学笔记中提取和推算东部合作肿瘤学组的表现状态。

IF 3.3 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2024-05-01 DOI:10.1200/CCI.23.00269

Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl

{"title":"使用语言模型从非结构化肿瘤学笔记中提取和推算东部合作肿瘤学组的表现状态。","authors":"Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl","doi":"10.1200/CCI.23.00269","DOIUrl":null,"url":null,"abstract":"Purpose: Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.Materials and methods: Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 v 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).Results: ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).Conclusion: NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2300269"},"PeriodicalIF":3.3000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492207/pdf/","citationCount":"0","resultStr":"{\"title\":\"Extraction and Imputation of Eastern Cooperative Oncology Group Performance Status From Unstructured Oncology Notes Using Language Models.\",\"authors\":\"Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl\",\"doi\":\"10.1200/CCI.23.00269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.Materials and methods: Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 v 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).Results: ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).Conclusion: NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":\"8 \",\"pages\":\"e2300269\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11492207/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI.23.00269\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.23.00269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：东部合作肿瘤学组（Eastern Cooperative Oncology Group，ECOG）的表现状态（PS）是癌症治疗和研究的一个关键临床变量，但它通常只以非结构化的形式记录在电子健康记录中。我们研究了自然语言处理（NLP）模型能否利用非结构化笔记文本推算 ECOG PS：从 1997 年到 2023 年，我们从中心的所有癌症患者中识别出了肿瘤内科笔记，并在患者层面上将其分为训练集（约占 80%）、调整/验证集（约占 10%）和测试集（约占 10%）。正则表达式用于提取明确记录的 PS。提取的 PS 标签用于训练 NLP 模型，以便从笔记的其余部分（去除正则表达式提取的 PS 文档）推算 ECOG PS（0-1 v 2-4）。我们评估了推算的 PS 与总生存期（OS）之间的关联：结果：使用正则表达式从 495,862 份笔记中提取了 ECOG PS，这些笔记对应于 79,698 名患者。基于变换器的 Longformer 模型以较高的辨别率估算了 PS（接收者操作特征曲线下的测试集面积为 0.95，精确度-召回曲线下的面积为 0.73）。推算出的不良PS与较差的OS有关，包括在没有明确PS检测记录的病例中（OS危险比，11.9；95% CI，11.1至12.8）：结论：NLP 模型可用于从非结构化的肿瘤学家笔记中大规模推断患者的表现状态。结论：NLP 模型可以大规模地从非结构化的肿瘤医生笔记中推断患者的表现状态，这将有助于为临床结果研究和癌症治疗提供肿瘤数据集注释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extraction and Imputation of Eastern Cooperative Oncology Group Performance Status From Unstructured Oncology Notes Using Language Models.

Purpose: Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.

Materials and methods: Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 v 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).

Results: ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).

Conclusion: NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190