Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.

IF 3.7 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Radiology Experimental Pub Date : 2025-06-19 DOI:10.1186/s41747-025-00600-2

Jonas Wihl, Enrike Rosenkranz, Severin Schramm, Cornelius Berberich, Michael Griessmair, Piotr Woźnicki, Francisco Pinto, Sebastian Ziegelmayer, Lisa C Adams, Keno K Bressem, Jan S Kirschke, Claus Zimmer, Benedikt Wiestler, Dennis Hedderich, Su Hwan Kim

{"title":"Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.","authors":"Jonas Wihl, Enrike Rosenkranz, Severin Schramm, Cornelius Berberich, Michael Griessmair, Piotr Woźnicki, Francisco Pinto, Sebastian Ziegelmayer, Lisa C Adams, Keno K Bressem, Jan S Kirschke, Claus Zimmer, Benedikt Wiestler, Dennis Hedderich, Su Hwan Kim","doi":"10.1186/s41747-025-00600-2","DOIUrl":null,"url":null,"abstract":"Background: To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.Methods: The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.Results: GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.Conclusion: GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.Relevance statement: Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.Key points: LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"61"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12179022/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-025-00600-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Background: To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.

Methods: The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.

Results: GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.

Conclusion: GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.

Relevance statement: Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.

Key points: LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

查看原文本刊更多论文

使用gpt - 40和Llama-3.3-70B从自由文本中风CT报告中提取数据：注释指南的影响。

背景：评估注释指南对大型语言模型（LLMs）从中风计算机断层扫描（CT）报告中提取数据的性能的影响。方法：在单个学术脑卒中中心的两个数据集中，评估gpt - 40和Llama-3.3-70B从脑卒中CT报告中提取10个成像结果的性能。数据集A （n = 200）是包含各种病理结果的分层队列，而数据集B （n = 100）是连续队列。最初，基于对数据集a中注释者之间存在分歧的情况的回顾，设计了一个提供明确数据提取指令的注释指南。对于每个LLM，在两种情况下进行数据提取：提示中包含注释指南和不包含注释指南。结果：在相同条件下，gpt - 40始终优于Llama-3.3-70B，其微平均精度范围为0.83 ~ 0.95,Llama-3.3-70B的微平均精度范围为0.65 ~ 0.86。在两个模型和两个数据集中，将注释指南合并到LLM输入中可以获得更高的准确率，而召回率基本保持稳定。在数据集B中，gpt - 40和Llama-3-70B的精度分别从0.83提高到0.95和0.87提高到0.94。在6个条件中的5个条件下，有和没有注释指南的总体分类性能存在显著差异。结论：gpt - 40和Llama-3.3-70B在从脑卒中CT报告中提取影像学表现方面表现良好，尽管gpt - 40的表现稳步优于Llama-3.3-70B。我们还提供了证据，证明定义良好的注释指南可以提高LLM数据提取的准确性。相关性声明：注释指南可以提高法学硕士从放射报告中提取结果的准确性，潜在地优化特定下游应用的数据提取。重点：llm在从放射学报告中提取数据方面具有实用性，但注释指南的作用仍未得到充分探索。当将定义明确的注释指南纳入模型提示时，gpt - 40和Llama-3.3-70B从脑卒中CT报告中提取数据的准确性得到提高。定义良好的注释指南可以提高llm从放射学报告中提取成像结果的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊