Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.

IF 3.1 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Clinical Oral Investigations Pub Date : 2025-08-12 DOI:10.1007/s00784-025-06498-9

Yu-Tao Xiong, Wen-Jun Lian, Ya-Nan Sun, Wei Liu, Ji-Xiang Guo, Wei Tang, Chang Liu

{"title":"Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.","authors":"Yu-Tao Xiong, Wen-Jun Lian, Ya-Nan Sun, Wei Liu, Ji-Xiang Guo, Wei Tang, Chang Liu","doi":"10.1007/s00784-025-06498-9","DOIUrl":null,"url":null,"abstract":"Objectives: The aim of this study was to evaluate GPT-4o's multimodal reasoning ability to review panoramic radiograph (PR) and verify its radiologic findings, while exploring the role of prompt engineering in enhancing its performance.Methods: The study included 230 PRs from West China Hospital of Stomatology in 2024, which were interpreted to generate the PR findings. A total of 300 instances of interpretation errors, were manually inserted into the PR findings. The ablation study was conducted to assess whether GPT-4o can perform reasoning on PR under a zero-shot prompt. Prompt engineering was employed to enhance the reasoning capabilities of GPT-4o in identifying interpretation errors with PRs. The prompt strategies included chain-of-thought, self-consistency, in-context learning, multimodal in-context learning, and their systematic integration into a meta-prompt. Recall, accuracy, and F1 score were employed to evaluate the outputs. Subsequently, the localization capability of GPT-4o and its influence on reasoning capability were evaluated.Results: In the ablation study, GPT-4o's recall increased significantly from 2.67 to 43.33% upon acquiring PRs (P < 0.001). GPT-4o with the meta prompt demonstrated improvements in recall (43.33% vs. 52.67%, P = 0.022), accuracy (39.95% vs. 68.75%, P < 0.001), and F1 score (0.42 vs. 0.60, P < 0.001) compared to the zero-shot prompt and other prompt strategies. The localization accuracy of GPT-4o was 45.67% (137 out of 300, 95% CI: 40.00 to 51.34). A significant correlation was observed between its localization accuracy and reasoning capability under the meta prompt (φ coefficient = 0.33, p < 0.001). The model's recall increased by 5.49% (P = 0.031) by providing accurate localization cues within the meta prompt.Conclusions: GPT-4o demonstrated a certain degree of multimodal capability for PR, with performance enhancement through prompt engineering. Nevertheless, its performance remains inadequate for clinical requirements. Future efforts will be necessary to identify additional factors influencing the model's reasoning capability or to develop more advanced models.Clinical relevance: Evaluating GPT-4o's capability to interpret and reason through PRs and exploring potential methods to enhance its performance before clinical application in assisting radiological assessments.","PeriodicalId":10461,"journal":{"name":"Clinical Oral Investigations","volume":"29 9","pages":"405"},"PeriodicalIF":3.1000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Oral Investigations","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00784-025-06498-9","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: The aim of this study was to evaluate GPT-4o's multimodal reasoning ability to review panoramic radiograph (PR) and verify its radiologic findings, while exploring the role of prompt engineering in enhancing its performance.

Methods: The study included 230 PRs from West China Hospital of Stomatology in 2024, which were interpreted to generate the PR findings. A total of 300 instances of interpretation errors, were manually inserted into the PR findings. The ablation study was conducted to assess whether GPT-4o can perform reasoning on PR under a zero-shot prompt. Prompt engineering was employed to enhance the reasoning capabilities of GPT-4o in identifying interpretation errors with PRs. The prompt strategies included chain-of-thought, self-consistency, in-context learning, multimodal in-context learning, and their systematic integration into a meta-prompt. Recall, accuracy, and F1 score were employed to evaluate the outputs. Subsequently, the localization capability of GPT-4o and its influence on reasoning capability were evaluated.

Results: In the ablation study, GPT-4o's recall increased significantly from 2.67 to 43.33% upon acquiring PRs (P < 0.001). GPT-4o with the meta prompt demonstrated improvements in recall (43.33% vs. 52.67%, P = 0.022), accuracy (39.95% vs. 68.75%, P < 0.001), and F1 score (0.42 vs. 0.60, P < 0.001) compared to the zero-shot prompt and other prompt strategies. The localization accuracy of GPT-4o was 45.67% (137 out of 300, 95% CI: 40.00 to 51.34). A significant correlation was observed between its localization accuracy and reasoning capability under the meta prompt (φ coefficient = 0.33, p < 0.001). The model's recall increased by 5.49% (P = 0.031) by providing accurate localization cues within the meta prompt.

Conclusions: GPT-4o demonstrated a certain degree of multimodal capability for PR, with performance enhancement through prompt engineering. Nevertheless, its performance remains inadequate for clinical requirements. Future efforts will be necessary to identify additional factors influencing the model's reasoning capability or to develop more advanced models.

Clinical relevance: Evaluating GPT-4o's capability to interpret and reason through PRs and exploring potential methods to enhance its performance before clinical application in assisting radiological assessments.

查看原文本刊更多论文

利用全景x光片探索gpt - 40的多模态推理能力：提示工程的作用。

目的：本研究的目的是评估gpt - 40对全景x线片（PR）的多模态推理能力，并验证其放射学发现，同时探索提示工程在提高其性能中的作用。方法：收集华西口腔医院2024年的230份PR资料，对其进行解读，得出PR结果。共有300个口译错误实例被人工插入公关调查结果中。消融研究是为了评估gpt - 40是否能在零射击提示下对PR进行推理。采用提示工程增强gpt - 40识别pr解释错误的推理能力。提示策略包括思维链策略、自我一致性策略、情境学习策略、多模态情境学习策略及其与元提示的系统整合。召回率、准确性和F1分数被用来评估输出。随后，评估gpt - 40的定位能力及其对推理能力的影响。结果：在消融研究中，gpt - 40在获得PR后的召回率从2.67%显著增加到43.33% (P)。结论：gpt - 40表现出一定程度的多模式PR能力，通过及时的工程处理增强了性能。然而，其性能仍不足以满足临床需要。未来的工作将需要确定影响模型推理能力的其他因素或开发更先进的模型。临床相关性：评估gpt - 40通过pr的解释和推理能力，并探索潜在的方法来提高其在辅助放射评估的临床应用之前的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Oral Investigations 医学-牙科与口腔外科

CiteScore

6.30

自引率

5.90%

发文量

484

审稿时长

3 months

期刊介绍： The journal Clinical Oral Investigations is a multidisciplinary, international forum for publication of research from all fields of oral medicine. The journal publishes original scientific articles and invited reviews which provide up-to-date results of basic and clinical studies in oral and maxillofacial science and medicine. The aim is to clarify the relevance of new results to modern practice, for an international readership. Coverage includes maxillofacial and oral surgery, prosthetics and restorative dentistry, operative dentistry, endodontics, periodontology, orthodontics, dental materials science, clinical trials, epidemiology, pedodontics, oral implant, preventive dentistiry, oral pathology, oral basic sciences and more.