Natural Language Processing to Extract Head and Neck Cancer Data From Unstructured Electronic Health Records

IF 3.2 3区医学 Q2 ONCOLOGY

Clinical oncology Pub Date : 2025-03-20 DOI:10.1016/j.clon.2025.103805

T. Young , J. Au Yeung , K. Sambasivan , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , M. Lei , A. King , J. Teo , T. Guerrero Urbano

{"title":"Natural Language Processing to Extract Head and Neck Cancer Data From Unstructured Electronic Health Records","authors":"T. Young , J. Au Yeung , K. Sambasivan , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , M. Lei , A. King , J. Teo , T. Guerrero Urbano","doi":"10.1016/j.clon.2025.103805","DOIUrl":null,"url":null,"abstract":"<div><h3>Aims</h3><div>Patient data is frequently stored as unstructured data within Electronic Health Records (EHRs), requiring manual curation. AI tools using Natural Language Processing (NLP) may rapidly curate accurate real-world unstructured EHRs to enrich datasets. We evaluated this approach for Head and Neck Cancer (HNC) patient data extraction using an open-source general-purpose healthcare NLP tool (CogStack).</div></div><div><h3>Materials and Methods</h3><div>CogStack was applied to extract relevant SNOMED-CT concepts from HNC patients' documents, generating outputs denoting the identifications of each concept for each patient. Outputs were compared to manually curated ground truth HNC datasets to calculate pre-training performance. Supervised model training was then performed using SNOMED-CT concept annotation on clinical documents, and the updated model was re-evaluated. A second training cycle was performed before the final evaluation. A thresholding approach (multiple detections needed to qualify a concept as ‘present’) was used to increase precision. The final model was evaluated on an unseen test cohort. F1 score (harmonic mean of precision and recall) was used for evaluation.</div></div><div><h3>Results</h3><div>Pre-training, the F1 score was incalculable for 19.5% of concepts due to insufficient recall. Following one training cycle, F1 score became calculable for all concepts (median 0.692). After further training, the final model demonstrated improvement in the median F1 score (0.708). Test cohort median F1 score was 0.750. Thresholding analysis developed a concept-specific best threshold approach, resulting in a median F1 score of 0.778 in the test cohort, where 50 out of 109 SNOMED-CT concepts met pre-set criteria to be considered adequately fine-tuned.</div></div><div><h3>Conclusions</h3><div>NLP can mine unstructured cancer data following limited training. Certain concepts such as histopathology terms remained poorly retrieved. Model performance is maintained when applied to a test cohort, demonstrating good generalisability. Concept-specific thresholding strategy improved performance. Fine-tuning annotations were incorporated into the NLP parent model for future performance. CogStack has been applied to extract data for 50 concepts with validated performance for our entire retrospective HNC cohort.</div></div>","PeriodicalId":10403,"journal":{"name":"Clinical oncology","volume":"41 ","pages":"Article 103805"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical oncology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0936655525000603","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Aims

Patient data is frequently stored as unstructured data within Electronic Health Records (EHRs), requiring manual curation. AI tools using Natural Language Processing (NLP) may rapidly curate accurate real-world unstructured EHRs to enrich datasets. We evaluated this approach for Head and Neck Cancer (HNC) patient data extraction using an open-source general-purpose healthcare NLP tool (CogStack).

Materials and Methods

CogStack was applied to extract relevant SNOMED-CT concepts from HNC patients' documents, generating outputs denoting the identifications of each concept for each patient. Outputs were compared to manually curated ground truth HNC datasets to calculate pre-training performance. Supervised model training was then performed using SNOMED-CT concept annotation on clinical documents, and the updated model was re-evaluated. A second training cycle was performed before the final evaluation. A thresholding approach (multiple detections needed to qualify a concept as ‘present’) was used to increase precision. The final model was evaluated on an unseen test cohort. F1 score (harmonic mean of precision and recall) was used for evaluation.

Results

Pre-training, the F1 score was incalculable for 19.5% of concepts due to insufficient recall. Following one training cycle, F1 score became calculable for all concepts (median 0.692). After further training, the final model demonstrated improvement in the median F1 score (0.708). Test cohort median F1 score was 0.750. Thresholding analysis developed a concept-specific best threshold approach, resulting in a median F1 score of 0.778 in the test cohort, where 50 out of 109 SNOMED-CT concepts met pre-set criteria to be considered adequately fine-tuned.

Conclusions

NLP can mine unstructured cancer data following limited training. Certain concepts such as histopathology terms remained poorly retrieved. Model performance is maintained when applied to a test cohort, demonstrating good generalisability. Concept-specific thresholding strategy improved performance. Fine-tuning annotations were incorporated into the NLP parent model for future performance. CogStack has been applied to extract data for 50 concepts with validated performance for our entire retrospective HNC cohort.

查看原文本刊更多论文

从非结构化电子健康记录中提取头颈癌数据的自然语言处理

aimpatient数据通常作为非结构化数据存储在电子健康记录（EHRs）中，需要手动管理。使用自然语言处理（NLP）的人工智能工具可以快速整理准确的真实世界非结构化电子病历，以丰富数据集。我们使用开源的通用医疗NLP工具（CogStack）评估了这种方法用于头颈癌（HNC）患者数据提取。使用scogstack从HNC患者的文档中提取相关的SNOMED-CT概念，生成输出表示每个患者的每个概念的识别。将输出与人工策划的地面真实HNC数据集进行比较，以计算训练前的性能。然后在临床文献上使用SNOMED-CT概念注释对模型进行监督训练，并对更新后的模型进行重新评估。在最后评价之前进行了第二个培训周期。使用阈值方法（需要多次检测才能将概念限定为“存在”）来提高精度。最后的模型是在一个未知的测试队列中进行评估的。采用F1评分（查准率和查全率的调和平均值）进行评价。结果训练前，19.5%的概念由于回忆不足导致F1分无法计算。经过一个训练周期后，所有概念的F1分数都可以计算（中位数为0.692）。进一步训练后，最终模型F1得分中位数（0.708）有所改善。测试队列F1得分中位数为0.750。阈值分析开发了一种特定于概念的最佳阈值方法，结果在测试队列中F1得分中位数为0.778，其中109个SNOMED-CT概念中有50个符合预先设定的标准，可以被认为是充分微调的。结论经过有限的训练，snlp可以挖掘非结构化的癌症数据。某些概念，如组织病理学术语仍然检索不足。当应用于测试队列时，模型性能保持不变，显示出良好的通用性。特定于概念的阈值策略提高了性能。微调注解被整合到NLP父模型中以提高未来的性能。CogStack已被应用于提取50个概念的数据，这些数据在我们的整个回顾性HNC队列中得到了验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical oncology 医学-肿瘤学

CiteScore

5.20

自引率

8.80%

发文量

332

审稿时长

40 days

期刊介绍： Clinical Oncology is an International cancer journal covering all aspects of the clinical management of cancer patients, reflecting a multidisciplinary approach to therapy. Papers, editorials and reviews are published on all types of malignant disease embracing, pathology, diagnosis and treatment, including radiotherapy, chemotherapy, surgery, combined modality treatment and palliative care. Research and review papers covering epidemiology, radiobiology, radiation physics, tumour biology, and immunology are also published, together with letters to the editor, case reports and book reviews.