使用大语言模型的肾囊肿波斯尼亚分类：比较研究。

IF 0.6

Radiologie (Heidelberg, Germany) Pub Date : 2025-08-24 DOI:10.1007/s00117-025-01499-x

Ibrahim Hacibey, Esat Kaba

{"title":"使用大语言模型的肾囊肿波斯尼亚分类：比较研究。","authors":"Ibrahim Hacibey, Esat Kaba","doi":"10.1007/s00117-025-01499-x","DOIUrl":null,"url":null,"abstract":"Background: The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.Objective: This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.Methods: A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.Results: GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.Conclusion: When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.","PeriodicalId":74635,"journal":{"name":"Radiologie (Heidelberg, Germany)","volume":" ","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bosniak classification of renal cysts using large language models: a comparative study.\",\"authors\":\"Ibrahim Hacibey, Esat Kaba\",\"doi\":\"10.1007/s00117-025-01499-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.Objective: This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.Methods: A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.Results: GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.Conclusion: When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.\",\"PeriodicalId\":74635,\"journal\":{\"name\":\"Radiologie (Heidelberg, Germany)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2025-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiologie (Heidelberg, Germany)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00117-025-01499-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiologie (Heidelberg, Germany)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00117-025-01499-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：Bosniak分类系统被广泛用于评估肾囊性病变的恶性风险，但观察者之间的差异带来了重大挑战。当提供文本描述时，大型语言模型（llm）可以提供标准化的分类方法，例如在放射学报告中发现的那些描述。目的：本研究评估五种LLMs-GPT - 4 （ChatGPT）、Gemini、Copilot、Perplexity和notebooklm -基于模拟CT报告内容的合成文本描述对肾囊肿进行分类的性能。方法：使用既定的放射标准构建100个诊断情景（每个波斯尼亚类别20例）的合成数据集。每个LLM使用零次和少次提示策略进行评估，而NotebookLM采用检索增强生成（RAG）。性能指标包括准确性、敏感性和特异性。采用McNemar检验和卡方检验评估统计学显著性。结果：GPT‑4的准确率最高（87%为零射，99%为少射），其次是Copilot（81-86%）、Gemini（55-69%）和Perplexity（43-69%）。仅在RAG条件下测试的NotebookLM准确率达到87%。结论：当提供结构良好的文本描述时，llm可以准确地分类肾囊肿。少量提示显著提高了性能。然而，对边界病变（如Bosniak IIF）进行分类的持续困难突出了进一步改进和实际验证的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bosniak classification of renal cysts using large language models: a comparative study.

Background: The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.

Objective: This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.

Methods: A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.

Results: GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.

Conclusion: When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radiologie (Heidelberg, Germany)

自引率

0.00%

发文量