The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.

IF 4 3区 医学 Q1 ANDROLOGY
Lun-Hsiang Yuan, Shi-Wei Huang, Dean Chou, Chung-You Tsai
{"title":"The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports.","authors":"Lun-Hsiang Yuan, Shi-Wei Huang, Dean Chou, Chung-You Tsai","doi":"10.5534/wjmh.240173","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.</p><p><strong>Materials and methods: </strong>We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients. We assessed four LLMs (ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, ChatGPT-3.5-turbo) on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven IR tasks. It included TNM staging, and the detection and quantification of bone and visceral metastases, providing a broad evaluation of their capabilities in handling diverse clinical data. We queried LLMs with multi-modality reports using zero-shot chain-of-thought prompting via application programming interface. With three adjudicators' consensus as the gold standard, these models' performances were assessed through repeated single-round queries and ensemble voting methods, using 6 outcome metrics.</p><p><strong>Results: </strong>Among 350 stage IV PC patients with simulated reports, 115 (32.9%), 128 (36.6%), and 94 (26.9%) belonged to LATITUDE, CHAARTED, and TwNHI high-risk, respectively. Ensemble voting, based on three repeated single-round queries, consistently enhances accuracy with a higher likelihood of achieving non-inferior results compared to a single query. Four models showed minimal differences in IR tasks with high accuracy (87.4%-94.2%) and consistency (ICC>0.8) in TNM staging. However, there were significant differences in RA performance, with the ranking as follows: ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, and ChatGPT-3.5-turbo, respectively. ChatGPT-4-turbo achieved the highest accuracy (90.1%, 90.7%,91.6%), and consistency (ICC 0.86, 0.93, 0.76) across 3 RA tasks.</p><p><strong>Conclusions: </strong>ChatGPT-4-turbo demonstrated satisfactory accuracy and outcomes in RA and IR for stage IV PC, suggesting its potential for clinical decision support. However, the risks of misinterpretation impacting decision-making cannot be overlooked. Further research is necessary to validate these findings in other cancers.</p>","PeriodicalId":54261,"journal":{"name":"World Journal of Mens Health","volume":" ","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Journal of Mens Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5534/wjmh.240173","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ANDROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Information retrieval (IR) and risk assessment (RA) from multi-modality imaging and pathology reports are critical to prostate cancer (PC) treatment. This study aims to evaluate the performance of four general-purpose large language model (LLMs) in IR and RA tasks.

Materials and methods: We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology on stage IV PC patients. We assessed four LLMs (ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, ChatGPT-3.5-turbo) on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven IR tasks. It included TNM staging, and the detection and quantification of bone and visceral metastases, providing a broad evaluation of their capabilities in handling diverse clinical data. We queried LLMs with multi-modality reports using zero-shot chain-of-thought prompting via application programming interface. With three adjudicators' consensus as the gold standard, these models' performances were assessed through repeated single-round queries and ensemble voting methods, using 6 outcome metrics.

Results: Among 350 stage IV PC patients with simulated reports, 115 (32.9%), 128 (36.6%), and 94 (26.9%) belonged to LATITUDE, CHAARTED, and TwNHI high-risk, respectively. Ensemble voting, based on three repeated single-round queries, consistently enhances accuracy with a higher likelihood of achieving non-inferior results compared to a single query. Four models showed minimal differences in IR tasks with high accuracy (87.4%-94.2%) and consistency (ICC>0.8) in TNM staging. However, there were significant differences in RA performance, with the ranking as follows: ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, and ChatGPT-3.5-turbo, respectively. ChatGPT-4-turbo achieved the highest accuracy (90.1%, 90.7%,91.6%), and consistency (ICC 0.86, 0.93, 0.76) across 3 RA tasks.

Conclusions: ChatGPT-4-turbo demonstrated satisfactory accuracy and outcomes in RA and IR for stage IV PC, suggesting its potential for clinical decision support. However, the risks of misinterpretation impacting decision-making cannot be overlooked. Further research is necessary to validate these findings in other cancers.

深入对比分析四种大语言人工智能模型,从多模式前列腺癌工作报告中进行风险评估和信息检索。
目的:从多模态影像和病理报告中进行信息检索(IR)和风险评估(RA)对前列腺癌(PC)的治疗至关重要。本研究旨在评估四种通用大型语言模型(llm)在IR和RA任务中的性能。材料和方法:我们对IV期PC患者进行了一项研究,使用计算机断层扫描、磁共振成像、骨扫描和活检病理的模拟文本报告。我们在三个RA任务(LATITUDE, charted, TwNHI)和七个IR任务上评估了四个llm (ChatGPT-4-turbo, Claude-3-opus, Gemini-Pro-1.0, ChatGPT-3.5-turbo)。它包括TNM分期,以及骨和内脏转移的检测和量化,提供了对其处理各种临床数据的能力的广泛评估。我们通过应用程序编程接口使用零射击思维链提示查询具有多模态报告的法学硕士。以三位评委的一致意见为金标准,通过重复的单轮查询和集合投票方法,使用6个结果指标来评估这些模型的表现。结果:在350例有模拟报告的IV期PC患者中,分别有115例(32.9%)、128例(36.6%)和94例(26.9%)属于LATITUDE、CHAARTED和TwNHI高危人群。集成投票基于三个重复的单轮查询,与单个查询相比,它始终提高准确性,获得不差结果的可能性更高。四种模型在TNM分期中具有高精度(87.4%-94.2%)和一致性(ICC>0.8)的IR任务差异很小。然而,在RA性能方面存在显著差异,排名分别为:ChatGPT-4-turbo、Claude-3-opus、Gemini-Pro-1.0和ChatGPT-3.5-turbo。ChatGPT-4-turbo在3个RA任务中获得了最高的准确率(90.1%,90.7%,91.6%)和一致性(ICC 0.86, 0.93, 0.76)。结论:ChatGPT-4-turbo在IV期PC的RA和IR中表现出令人满意的准确性和结果,表明其具有临床决策支持的潜力。然而,不能忽视误读影响决策的风险。在其他癌症中验证这些发现还需要进一步的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
World Journal of Mens Health
World Journal of Mens Health Medicine-Psychiatry and Mental Health
CiteScore
7.60
自引率
2.10%
发文量
92
审稿时长
6 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信