Andrea De Vito, Nicholas Geremia, Davide Fiore Bavaro, Susan K Seo, Justin Laracy, Maria Mazzitelli, Andrea Marino, Alberto Enrico Maraolo, Antonio Russo, Agnese Colpani, Michele Bartoletti, Anna Maria Cattelan, Cristina Mussini, Saverio Giuseppe Parisi, Luigi Angelo Vaira, Giuseppe Nunnari, Giordano Madeddu
{"title":"比较不同临床情况下抗生素处方的大语言模型:哪个表现更好?","authors":"Andrea De Vito, Nicholas Geremia, Davide Fiore Bavaro, Susan K Seo, Justin Laracy, Maria Mazzitelli, Andrea Marino, Alberto Enrico Maraolo, Antonio Russo, Agnese Colpani, Michele Bartoletti, Anna Maria Cattelan, Cristina Mussini, Saverio Giuseppe Parisi, Luigi Angelo Vaira, Giuseppe Nunnari, Giordano Madeddu","doi":"10.1016/j.cmi.2025.03.002","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) show promise in clinical decision-making, but comparative evaluations of their antibiotic prescribing accuracy are limited. This study assesses the performance of various LLMs in recommending antibiotic treatments across diverse clinical scenarios.</p><p><strong>Methods: </strong>Fourteen LLMs, including standard and premium versions of ChatGPT, Claude, Copilot, Gemini, Le Chat, Grok, Perplexity, and Pi.ai, were evaluated using 60 clinical cases with antibiograms covering 10 infection types. A standardized prompt was used for antibiotic recommendations focusing on drug choice, dosage, and treatment duration. Responses were anonymized and reviewed by a blinded expert panel assessing antibiotic appropriateness, dosage correctness, and duration adequacy.</p><p><strong>Results: </strong>A total of 840 responses were collected and analysed. ChatGPT-o1 demonstrated the highest accuracy in antibiotic prescriptions, with 71.7% (43/60) of its recommendations classified as correct and only one (1.7%) incorrect. Gemini and Claude 3 Opus had the lowest accuracy. Dosage correctness was highest for ChatGPT-o1 (96.7%, 58/60), followed by Perplexity Pro (90.0%, 54/60) and Claude 3.5 Sonnet (91.7%, 55/60). In treatment duration, Gemini provided the most appropriate recommendations (75.0%, 45/60), whereas Claude 3.5 Sonnet tended to over-prescribe duration. Performance declined with increasing case complexity, particularly for difficult-to-treat microorganisms.</p><p><strong>Discussion: </strong>There is significant variability among LLMs in prescribing appropriate antibiotics, dosages, and treatment durations. ChatGPT-o1 outperformed other models, indicating the potential of advanced LLMs as decision-support tools in antibiotic prescribing. However, decreased accuracy in complex cases and inconsistencies among models highlight the need for careful validation before clinical utilization.</p>","PeriodicalId":10444,"journal":{"name":"Clinical Microbiology and Infection","volume":" ","pages":""},"PeriodicalIF":10.9000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?\",\"authors\":\"Andrea De Vito, Nicholas Geremia, Davide Fiore Bavaro, Susan K Seo, Justin Laracy, Maria Mazzitelli, Andrea Marino, Alberto Enrico Maraolo, Antonio Russo, Agnese Colpani, Michele Bartoletti, Anna Maria Cattelan, Cristina Mussini, Saverio Giuseppe Parisi, Luigi Angelo Vaira, Giuseppe Nunnari, Giordano Madeddu\",\"doi\":\"10.1016/j.cmi.2025.03.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Large language models (LLMs) show promise in clinical decision-making, but comparative evaluations of their antibiotic prescribing accuracy are limited. This study assesses the performance of various LLMs in recommending antibiotic treatments across diverse clinical scenarios.</p><p><strong>Methods: </strong>Fourteen LLMs, including standard and premium versions of ChatGPT, Claude, Copilot, Gemini, Le Chat, Grok, Perplexity, and Pi.ai, were evaluated using 60 clinical cases with antibiograms covering 10 infection types. A standardized prompt was used for antibiotic recommendations focusing on drug choice, dosage, and treatment duration. Responses were anonymized and reviewed by a blinded expert panel assessing antibiotic appropriateness, dosage correctness, and duration adequacy.</p><p><strong>Results: </strong>A total of 840 responses were collected and analysed. ChatGPT-o1 demonstrated the highest accuracy in antibiotic prescriptions, with 71.7% (43/60) of its recommendations classified as correct and only one (1.7%) incorrect. Gemini and Claude 3 Opus had the lowest accuracy. Dosage correctness was highest for ChatGPT-o1 (96.7%, 58/60), followed by Perplexity Pro (90.0%, 54/60) and Claude 3.5 Sonnet (91.7%, 55/60). In treatment duration, Gemini provided the most appropriate recommendations (75.0%, 45/60), whereas Claude 3.5 Sonnet tended to over-prescribe duration. Performance declined with increasing case complexity, particularly for difficult-to-treat microorganisms.</p><p><strong>Discussion: </strong>There is significant variability among LLMs in prescribing appropriate antibiotics, dosages, and treatment durations. ChatGPT-o1 outperformed other models, indicating the potential of advanced LLMs as decision-support tools in antibiotic prescribing. However, decreased accuracy in complex cases and inconsistencies among models highlight the need for careful validation before clinical utilization.</p>\",\"PeriodicalId\":10444,\"journal\":{\"name\":\"Clinical Microbiology and Infection\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":10.9000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Microbiology and Infection\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.cmi.2025.03.002\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"INFECTIOUS DISEASES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Microbiology and Infection","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.cmi.2025.03.002","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}
Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?
Objectives: Large language models (LLMs) show promise in clinical decision-making, but comparative evaluations of their antibiotic prescribing accuracy are limited. This study assesses the performance of various LLMs in recommending antibiotic treatments across diverse clinical scenarios.
Methods: Fourteen LLMs, including standard and premium versions of ChatGPT, Claude, Copilot, Gemini, Le Chat, Grok, Perplexity, and Pi.ai, were evaluated using 60 clinical cases with antibiograms covering 10 infection types. A standardized prompt was used for antibiotic recommendations focusing on drug choice, dosage, and treatment duration. Responses were anonymized and reviewed by a blinded expert panel assessing antibiotic appropriateness, dosage correctness, and duration adequacy.
Results: A total of 840 responses were collected and analysed. ChatGPT-o1 demonstrated the highest accuracy in antibiotic prescriptions, with 71.7% (43/60) of its recommendations classified as correct and only one (1.7%) incorrect. Gemini and Claude 3 Opus had the lowest accuracy. Dosage correctness was highest for ChatGPT-o1 (96.7%, 58/60), followed by Perplexity Pro (90.0%, 54/60) and Claude 3.5 Sonnet (91.7%, 55/60). In treatment duration, Gemini provided the most appropriate recommendations (75.0%, 45/60), whereas Claude 3.5 Sonnet tended to over-prescribe duration. Performance declined with increasing case complexity, particularly for difficult-to-treat microorganisms.
Discussion: There is significant variability among LLMs in prescribing appropriate antibiotics, dosages, and treatment durations. ChatGPT-o1 outperformed other models, indicating the potential of advanced LLMs as decision-support tools in antibiotic prescribing. However, decreased accuracy in complex cases and inconsistencies among models highlight the need for careful validation before clinical utilization.
期刊介绍:
Clinical Microbiology and Infection (CMI) is a monthly journal published by the European Society of Clinical Microbiology and Infectious Diseases. It focuses on peer-reviewed papers covering basic and applied research in microbiology, infectious diseases, virology, parasitology, immunology, and epidemiology as they relate to therapy and diagnostics.