评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。

IF 1 4区医学 Q3 OTORHINOLARYNGOLOGY

Acta Oto-Laryngologica Pub Date : 2024-03-01 Epub Date: 2024-05-23 DOI:10.1080/00016489.2024.2352843

Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich

{"title":"评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。","authors":"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich","doi":"10.1080/00016489.2024.2352843","DOIUrl":null,"url":null,"abstract":"Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.","PeriodicalId":6880,"journal":{"name":"Acta Oto-Laryngologica","volume":" ","pages":"237-242"},"PeriodicalIF":1.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.\",\"authors\":\"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich\",\"doi\":\"10.1080/00016489.2024.2352843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.\",\"PeriodicalId\":6880,\"journal\":{\"name\":\"Acta Oto-Laryngologica\",\"volume\":\" \",\"pages\":\"237-242\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Oto-Laryngologica\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/00016489.2024.2352843\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/5/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"OTORHINOLARYNGOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Oto-Laryngologica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00016489.2024.2352843","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/23 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：大型语言模型（LLMs）可以为缺乏训练有素的医疗人员提供解决方案，尤其是在中低收入国家。然而，它们的优缺点仍不明确：在此，我们以耳鼻喉科（ORL）的六名顾问为对象，对不同的语言模型（Bard 2023.07.13、Claude 2、ChatGPT 4）进行了基准测试：从文献和德国国家考试中提取了基于案例的问题。对 Bard 2023.07.13、Claude 2、ChatGPT 4 和六位耳鼻喉科顾问的答案进行了盲评，采用李克特（Likert）6 点量表，对医学充分性、可理解性、连贯性和简洁性进行评分。给出的答案与经过验证的答案进行了比较，并对危险性进行了评估。进行了修改后的图灵测试，并对字符数进行了比较：结果：在所有类别中，法律硕士的答案都不如顾问。然而，顾问和法律硕士之间的差距微乎其微，在简洁性方面差距最明显，而在可理解性方面差距最小。在法律硕士中，克劳德 2 在医学充分性和简洁性方面被评为最佳。顾问的答案有 93%（228/246）与验证方案相符，ChatGPT 4 有 85%（35/41），Claude 2 有 78%（32/41），Bard 2023.07.13 有 59%（24/41）。在 ChatGPT 4 中，10%（24/246）的答案被评为潜在危险；在 Claude 2 中，14%（34/246）的答案被评为潜在危险；在 Bard 2023.07.13 中，19%（46/264）的答案被评为潜在危险；在顾问中，6%（71/1230）的答案被评为潜在危险：尽管咨询师的性能更优越，但 LLM 在 ORL 的临床应用中仍有潜力。未来的研究应更大规模地评估其性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.

Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.

Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).

Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.

Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.

Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Acta Oto-Laryngologica 医学-耳鼻喉科学

CiteScore

2.50

自引率

0.00%

发文量

审稿时长

3-6 weeks

期刊介绍： Acta Oto-Laryngologica is a truly international journal for translational otolaryngology and head- and neck surgery. The journal presents cutting-edge papers on clinical practice, clinical research and basic sciences. Acta also bridges the gap between clinical and basic research.