Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich
{"title":"评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。","authors":"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich","doi":"10.1080/00016489.2024.2352843","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.</p><p><strong>Aims/objectives: </strong>Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).</p><p><strong>Material and methods: </strong>Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.</p><p><strong>Results: </strong>LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.</p><p><strong>Conclusions and significance: </strong>Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.</p>","PeriodicalId":6880,"journal":{"name":"Acta Oto-Laryngologica","volume":" ","pages":"237-242"},"PeriodicalIF":1.2000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.\",\"authors\":\"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich\",\"doi\":\"10.1080/00016489.2024.2352843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.</p><p><strong>Aims/objectives: </strong>Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).</p><p><strong>Material and methods: </strong>Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.</p><p><strong>Results: </strong>LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.</p><p><strong>Conclusions and significance: </strong>Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.</p>\",\"PeriodicalId\":6880,\"journal\":{\"name\":\"Acta Oto-Laryngologica\",\"volume\":\" \",\"pages\":\"237-242\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Oto-Laryngologica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/00016489.2024.2352843\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/5/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"OTORHINOLARYNGOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Oto-Laryngologica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00016489.2024.2352843","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/23 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.
Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.
Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).
Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.
Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.
Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
期刊介绍:
Acta Oto-Laryngologica is a truly international journal for translational otolaryngology and head- and neck surgery. The journal presents cutting-edge papers on clinical practice, clinical research and basic sciences. Acta also bridges the gap between clinical and basic research.