T. Young , J. Au Yeung , K. Sambasivan , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , M. Lei , A. King , J. Teo , T. Guerrero Urbano
{"title":"Natural Language Processing to Extract Head and Neck Cancer Data From Unstructured Electronic Health Records","authors":"T. Young , J. Au Yeung , K. Sambasivan , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , M. Lei , A. King , J. Teo , T. Guerrero Urbano","doi":"10.1016/j.clon.2025.103805","DOIUrl":null,"url":null,"abstract":"<div><h3>Aims</h3><div>Patient data is frequently stored as unstructured data within Electronic Health Records (EHRs), requiring manual curation. AI tools using Natural Language Processing (NLP) may rapidly curate accurate real-world unstructured EHRs to enrich datasets. We evaluated this approach for Head and Neck Cancer (HNC) patient data extraction using an open-source general-purpose healthcare NLP tool (CogStack).</div></div><div><h3>Materials and Methods</h3><div>CogStack was applied to extract relevant SNOMED-CT concepts from HNC patients' documents, generating outputs denoting the identifications of each concept for each patient. Outputs were compared to manually curated ground truth HNC datasets to calculate pre-training performance. Supervised model training was then performed using SNOMED-CT concept annotation on clinical documents, and the updated model was re-evaluated. A second training cycle was performed before the final evaluation. A thresholding approach (multiple detections needed to qualify a concept as ‘present’) was used to increase precision. The final model was evaluated on an unseen test cohort. F1 score (harmonic mean of precision and recall) was used for evaluation.</div></div><div><h3>Results</h3><div>Pre-training, the F1 score was incalculable for 19.5% of concepts due to insufficient recall. Following one training cycle, F1 score became calculable for all concepts (median 0.692). After further training, the final model demonstrated improvement in the median F1 score (0.708). Test cohort median F1 score was 0.750. Thresholding analysis developed a concept-specific best threshold approach, resulting in a median F1 score of 0.778 in the test cohort, where 50 out of 109 SNOMED-CT concepts met pre-set criteria to be considered adequately fine-tuned.</div></div><div><h3>Conclusions</h3><div>NLP can mine unstructured cancer data following limited training. Certain concepts such as histopathology terms remained poorly retrieved. Model performance is maintained when applied to a test cohort, demonstrating good generalisability. Concept-specific thresholding strategy improved performance. Fine-tuning annotations were incorporated into the NLP parent model for future performance. CogStack has been applied to extract data for 50 concepts with validated performance for our entire retrospective HNC cohort.</div></div>","PeriodicalId":10403,"journal":{"name":"Clinical oncology","volume":"41 ","pages":"Article 103805"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical oncology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0936655525000603","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Aims
Patient data is frequently stored as unstructured data within Electronic Health Records (EHRs), requiring manual curation. AI tools using Natural Language Processing (NLP) may rapidly curate accurate real-world unstructured EHRs to enrich datasets. We evaluated this approach for Head and Neck Cancer (HNC) patient data extraction using an open-source general-purpose healthcare NLP tool (CogStack).
Materials and Methods
CogStack was applied to extract relevant SNOMED-CT concepts from HNC patients' documents, generating outputs denoting the identifications of each concept for each patient. Outputs were compared to manually curated ground truth HNC datasets to calculate pre-training performance. Supervised model training was then performed using SNOMED-CT concept annotation on clinical documents, and the updated model was re-evaluated. A second training cycle was performed before the final evaluation. A thresholding approach (multiple detections needed to qualify a concept as ‘present’) was used to increase precision. The final model was evaluated on an unseen test cohort. F1 score (harmonic mean of precision and recall) was used for evaluation.
Results
Pre-training, the F1 score was incalculable for 19.5% of concepts due to insufficient recall. Following one training cycle, F1 score became calculable for all concepts (median 0.692). After further training, the final model demonstrated improvement in the median F1 score (0.708). Test cohort median F1 score was 0.750. Thresholding analysis developed a concept-specific best threshold approach, resulting in a median F1 score of 0.778 in the test cohort, where 50 out of 109 SNOMED-CT concepts met pre-set criteria to be considered adequately fine-tuned.
Conclusions
NLP can mine unstructured cancer data following limited training. Certain concepts such as histopathology terms remained poorly retrieved. Model performance is maintained when applied to a test cohort, demonstrating good generalisability. Concept-specific thresholding strategy improved performance. Fine-tuning annotations were incorporated into the NLP parent model for future performance. CogStack has been applied to extract data for 50 concepts with validated performance for our entire retrospective HNC cohort.
期刊介绍:
Clinical Oncology is an International cancer journal covering all aspects of the clinical management of cancer patients, reflecting a multidisciplinary approach to therapy. Papers, editorials and reviews are published on all types of malignant disease embracing, pathology, diagnosis and treatment, including radiotherapy, chemotherapy, surgery, combined modality treatment and palliative care. Research and review papers covering epidemiology, radiobiology, radiation physics, tumour biology, and immunology are also published, together with letters to the editor, case reports and book reviews.