大型语言模型与血栓专家：静脉血栓栓塞患者教育和临床决策的比较研究。

IF 5 2区医学 Q1 HEMATOLOGY

Journal of Thrombosis and Haemostasis Pub Date : 2025-09-25 DOI:10.1016/j.jtha.2025.09.004

Nikola Vladic, Stephan Nopp, Ingrid Pabinger, Walter Ageno, Jean M Connors, Sabine Eichinger, Cihan Ay

{"title":"大型语言模型与血栓专家：静脉血栓栓塞患者教育和临床决策的比较研究。","authors":"Nikola Vladic, Stephan Nopp, Ingrid Pabinger, Walter Ageno, Jean M Connors, Sabine Eichinger, Cihan Ay","doi":"10.1016/j.jtha.2025.09.004","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have demonstrated remarkable capabilities in various medical fields, yet their performance in thrombosis and haemostasis, particularly in patient education and complex clinical decision-making, is unexplored.Objectives: We aimed to compare the quality of responses from LLMs versus thrombosis experts and assess clinician's ability to distinguish between them.Methods: Three experts on thrombosis and haemostasis and three LLMs (Le Chat Pixtral Large, DeepSeek-R1, ChatGPT-4.5) answered three patient education and three clinical decision-making queries. Thirty-seven physicians rated responses for adequacy (1 = \"very poor\", 10 = \"excellent\") and estimated their origin (1 = \"certainly LLM,\" 10 = \"certainly human\"). Mean differences were assessed via t-tests, medians via Wilcoxon tests, and correlations via Spearman's test. All p values were Bonferroni-adjusted.Results: LLMs provided significantly better patient education responses than experts. Mean adequacy score differences were: Le Chat Pixtral Large +1.6 (95% CI: 1.3-2.0, p<0.01), DeepSeek-R1 +1.7 (95% CI: 1.3-2.1, p<0.001), and ChatGPT-4.5 +1.9 (95% CI: 1.6-2.3, p<0.001). In clinical decision-making, DeepSeek-R1 outperformed experts, (+1.4; 95% CI: 1.1-1.8, p<0.001), whereas Le Chat Pixtral Large (-0.3; 95% CI, -0.8-0.1, p=0.96) and ChatGPT-4.5 (+0.5; 95% CI: 0.0-0.9, p=0.18), performed comparably to experts. Evaluators couldn't distinguish between expert (median: 6.0, interquartile range [IQR]: 3.0-8.0) and LLM-generated responses (median 6.0, IQR: 4.0-8.0).Conclusion: LLMs outperform experts in VTE-related patient education and match or exceed them in clinical decision-making, providing responses indistinguishable from experts. Though major barriers need to be addressed, LLMs have strong potential to support clinical management and patient education in VTE.","PeriodicalId":17326,"journal":{"name":"Journal of Thrombosis and Haemostasis","volume":" ","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models versus thrombosis experts: A comparative study on patient education and clinical decision-making in venous thromboembolism.\",\"authors\":\"Nikola Vladic, Stephan Nopp, Ingrid Pabinger, Walter Ageno, Jean M Connors, Sabine Eichinger, Cihan Ay\",\"doi\":\"10.1016/j.jtha.2025.09.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large language models (LLMs) have demonstrated remarkable capabilities in various medical fields, yet their performance in thrombosis and haemostasis, particularly in patient education and complex clinical decision-making, is unexplored.Objectives: We aimed to compare the quality of responses from LLMs versus thrombosis experts and assess clinician's ability to distinguish between them.Methods: Three experts on thrombosis and haemostasis and three LLMs (Le Chat Pixtral Large, DeepSeek-R1, ChatGPT-4.5) answered three patient education and three clinical decision-making queries. Thirty-seven physicians rated responses for adequacy (1 = \\\"very poor\\\", 10 = \\\"excellent\\\") and estimated their origin (1 = \\\"certainly LLM,\\\" 10 = \\\"certainly human\\\"). Mean differences were assessed via t-tests, medians via Wilcoxon tests, and correlations via Spearman's test. All p values were Bonferroni-adjusted.Results: LLMs provided significantly better patient education responses than experts. Mean adequacy score differences were: Le Chat Pixtral Large +1.6 (95% CI: 1.3-2.0, p<0.01), DeepSeek-R1 +1.7 (95% CI: 1.3-2.1, p<0.001), and ChatGPT-4.5 +1.9 (95% CI: 1.6-2.3, p<0.001). In clinical decision-making, DeepSeek-R1 outperformed experts, (+1.4; 95% CI: 1.1-1.8, p<0.001), whereas Le Chat Pixtral Large (-0.3; 95% CI, -0.8-0.1, p=0.96) and ChatGPT-4.5 (+0.5; 95% CI: 0.0-0.9, p=0.18), performed comparably to experts. Evaluators couldn't distinguish between expert (median: 6.0, interquartile range [IQR]: 3.0-8.0) and LLM-generated responses (median 6.0, IQR: 4.0-8.0).Conclusion: LLMs outperform experts in VTE-related patient education and match or exceed them in clinical decision-making, providing responses indistinguishable from experts. Though major barriers need to be addressed, LLMs have strong potential to support clinical management and patient education in VTE.\",\"PeriodicalId\":17326,\"journal\":{\"name\":\"Journal of Thrombosis and Haemostasis\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Thrombosis and Haemostasis\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jtha.2025.09.004\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thrombosis and Haemostasis","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jtha.2025.09.004","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEMATOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（LLMs）已经在各个医学领域展示了卓越的能力，但它们在血栓和止血方面的表现，特别是在患者教育和复杂的临床决策方面的表现尚未得到探索。目的：我们旨在比较法学硕士和血栓专家的反应质量，并评估临床医生区分他们的能力。方法：三位血栓和止血专家和三位LLMs （Le Chat Pixtral Large, DeepSeek-R1, ChatGPT-4.5）回答了三个患者教育和三个临床决策问题。37名医生对回答的充分性进行了评分（1 =“非常差”，10 =“优秀”），并估计了他们的来源（1 =“肯定是法学硕士”，10 =“肯定是人类”）。均数差异采用t检验，中位数采用Wilcoxon检验，相关性采用Spearman检验。所有p值均经bonferroni校正。结果：法学硕士的患者教育反应明显优于专家。平均充足性评分差异为：Le Chat Pixtral Large +1.6 （95% CI: 1.3-2.0, p）结论：LLMs在vte相关患者教育方面优于专家，在临床决策方面匹配或超过专家，提供与专家没有区别的反应。虽然主要的障碍需要解决，法学硕士有强大的潜力，支持临床管理和患者教育VTE。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large language models versus thrombosis experts: A comparative study on patient education and clinical decision-making in venous thromboembolism.

Background: Large language models (LLMs) have demonstrated remarkable capabilities in various medical fields, yet their performance in thrombosis and haemostasis, particularly in patient education and complex clinical decision-making, is unexplored.

Objectives: We aimed to compare the quality of responses from LLMs versus thrombosis experts and assess clinician's ability to distinguish between them.

Methods: Three experts on thrombosis and haemostasis and three LLMs (Le Chat Pixtral Large, DeepSeek-R1, ChatGPT-4.5) answered three patient education and three clinical decision-making queries. Thirty-seven physicians rated responses for adequacy (1 = "very poor", 10 = "excellent") and estimated their origin (1 = "certainly LLM," 10 = "certainly human"). Mean differences were assessed via t-tests, medians via Wilcoxon tests, and correlations via Spearman's test. All p values were Bonferroni-adjusted.

Results: LLMs provided significantly better patient education responses than experts. Mean adequacy score differences were: Le Chat Pixtral Large +1.6 (95% CI: 1.3-2.0, p<0.01), DeepSeek-R1 +1.7 (95% CI: 1.3-2.1, p<0.001), and ChatGPT-4.5 +1.9 (95% CI: 1.6-2.3, p<0.001). In clinical decision-making, DeepSeek-R1 outperformed experts, (+1.4; 95% CI: 1.1-1.8, p<0.001), whereas Le Chat Pixtral Large (-0.3; 95% CI, -0.8-0.1, p=0.96) and ChatGPT-4.5 (+0.5; 95% CI: 0.0-0.9, p=0.18), performed comparably to experts. Evaluators couldn't distinguish between expert (median: 6.0, interquartile range [IQR]: 3.0-8.0) and LLM-generated responses (median 6.0, IQR: 4.0-8.0).

Conclusion: LLMs outperform experts in VTE-related patient education and match or exceed them in clinical decision-making, providing responses indistinguishable from experts. Though major barriers need to be addressed, LLMs have strong potential to support clinical management and patient education in VTE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Thrombosis and Haemostasis 医学-外周血管病

CiteScore

24.30

自引率

3.80%

发文量

321

审稿时长

1 months

期刊介绍： The Journal of Thrombosis and Haemostasis (JTH) serves as the official journal of the International Society on Thrombosis and Haemostasis. It is dedicated to advancing science related to thrombosis, bleeding disorders, and vascular biology through the dissemination and exchange of information and ideas within the global research community. Types of Publications: The journal publishes a variety of content, including: Original research reports State-of-the-art reviews Brief reports Case reports Invited commentaries on publications in the Journal Forum articles Correspondence Announcements Scope of Contributions: Editors invite contributions from both fundamental and clinical domains. These include: Basic manuscripts on blood coagulation and fibrinolysis Studies on proteins and reactions related to thrombosis and haemostasis Research on blood platelets and their interactions with other biological systems, such as the vessel wall, blood cells, and invading organisms Clinical manuscripts covering various topics including venous thrombosis, arterial disease, hemophilia, bleeding disorders, and platelet diseases Clinical manuscripts may encompass etiology, diagnostics, prognosis, prevention, and treatment strategies.