大语言模型临床决策何时进行肾活检：比较研究。

IF 6 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-09-18 DOI:10.2196/73603

Michael Toal, Christopher Hill, Michael Quinn, Ciaran O'Neill, Alexander P Maxwell

{"title":"大语言模型临床决策何时进行肾活检：比较研究。","authors":"Michael Toal, Christopher Hill, Michael Quinn, Ciaran O'Neill, Alexander P Maxwell","doi":"10.2196/73603","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research.Objective: This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed.Methods: This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks.Results: The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34.Conclusions: The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e73603"},"PeriodicalIF":6.0000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12445783/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large Language Models' Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study.\",\"authors\":\"Michael Toal, Christopher Hill, Michael Quinn, Ciaran O'Neill, Alexander P Maxwell\",\"doi\":\"10.2196/73603\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research.Objective: This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed.Methods: This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks.Results: The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34.Conclusions: The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\"27 \",\"pages\":\"e73603\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12445783/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/73603\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73603","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能（AI）和大型语言模型（llm）越来越复杂，并被整合到许多学科中。法学硕士增强临床决策的潜力是一个不断发展的研究领域。目的：本研究通过问卷调查来确定何时应该进行肾活检，将1000多名肾脏专科医生（肾病学家）的回答与常用llm的结果进行比较。方法：本研究组为肾病学家完成了一份大型在线问卷，以确定何时应该进行肾活检。该问卷是与患者共同设计的，经过多次迭代改进，并在国际传播之前在当地进行试点。这是该领域最大的国际研究，并证明了人类临床医生在活检倾向方面的差异与人为因素（如性别和年龄）以及系统因素（如国家、工作资历和技术熟练程度）有关。同样的问题被以相同的顺序在同一次会议中向人类医生和法学硕士提出。八个常用的llm被询问：ChatGPT-3.5, Mistral hug Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4， MedLM和Claude 3。将临床医生（人类模式）对每个问题给出的最常见回答作为比较的基线。关于活检适应症和禁忌症的问卷调查结果产生了一个反映活检倾向的评分（0-44），其中更高的评分被用作对潜在相关风险耐受性增加的替代标记。结果：法学硕士复制人类专家共识的能力差异很大，一些模型以类似于人类的方式展示了一种平衡的风险方法，而其他模型则报告了风险容忍度光谱两端的输出。在与人类模式的一致性方面，ChatGPT-3.5和GPT-4 （OpenAI）的一致性最高，在11个问题中的6个问题上与人类模式一致。从人体模型生成的总活检倾向评分为23分（满分44分）。两个OpenAI模型在22岁到24岁之间产生了相似的倾向得分。然而，Llama 2和MS Copilot的得分也在这个范围内，但在11个问题中，只有2个问题的回答与人类共识的一致性较差。本研究风险厌恶程度最高的模型是MedLM，倾向得分为11，风险厌恶程度最低的模型是Claude 3，倾向得分为34。结论：在这项研究中，LLMs的输出显示出适度的复制人类临床决策的能力；然而，LLM模型之间的性能差异很大。具有更统一的人类反应的问题产生具有更高一致性的LLM输出，而具有较低人类共识的问题则显示较差的输出一致性。这可能会限制法学硕士在实际临床实践中的实际应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Language Models' Clinical Decision-Making on When to Perform a Kidney Biopsy: Comparative Study.

Background: Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research.

Objective: This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed.

Methods: This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks.

Results: The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34.

Conclusions: The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.