Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study.

IF 2.7 Q2 ONCOLOGY

JMIR Cancer Pub Date : 2025-10-02 DOI:10.2196/72005

Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad

{"title":"Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study.","authors":"Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad","doi":"10.2196/72005","DOIUrl":null,"url":null,"abstract":"Background: Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation.Objective: The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data.Methods: We analyzed 762 unique diagnoses (326 International Classification of Diseases [ICD] code descriptions, 436 free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14 predefined categories. Two oncology experts validated classifications.Results: BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology.Conclusions: Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.","PeriodicalId":45538,"journal":{"name":"JMIR Cancer","volume":"11 ","pages":"e72005"},"PeriodicalIF":2.7000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490771/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cancer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/72005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation.

Objective: The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data.

Methods: We analyzed 762 unique diagnoses (326 International Classification of Diseases [ICD] code descriptions, 436 free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14 predefined categories. Two oncology experts validated classifications.

Results: BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology.

Conclusions: Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.

查看原文本刊更多论文

使用大型语言模型和BioBERT的电子健康记录中的癌症诊断分类：模型性能评估研究。

背景：电子健康记录包含不一致的结构化或自由文本数据，需要有效的预处理才能实现预测性医疗保健模型。尽管人工智能驱动的自然语言处理工具显示出自动化诊断分类的前景，但它们的比较性能和临床可靠性需要系统的评估。目的：本研究的目的是评估4种大型语言模型（GPT-3.5、gpt - 40、Llama 3.2和Gemini 1.5）和BioBERT在结构化和非结构化电子健康记录数据中癌症诊断分类的性能。方法：对3456例癌症患者病历中的762条独特诊断（326条国际疾病分类[ICD]代码描述，436条自由文本条目）进行分析。测试了模型将诊断分为14个预定义类别的能力。两位肿瘤学专家验证了分类。结果：BioBERT在ICD编码上获得了最高的加权宏观f1分（84.2），在ICD编码精度上与gpt - 40匹配（90.8）。对于自由文本诊断，gpt - 40在加权宏观f1评分上优于BioBERT(71.8比61.5)，准确率略高（81.9比81.6）。GPT-3.5、Gemini和Llama在两种格式下的总体表现都较低。常见的错误分类模式包括转移瘤和中枢神经系统肿瘤之间的混淆，以及涉及含糊或重叠临床术语的错误。结论：尽管目前的性能水平足以用于管理和研究，但可靠的临床应用将需要标准化的文件实践以及对高风险决策的强有力的人类监督。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊