Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions.
{"title":"Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions.","authors":"Eun Sun Song, Seung-Pyo Lee","doi":"10.1111/idh.12848","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.</p><p><strong>Methods: </strong>This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019-2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.</p><p><strong>Results: </strong>GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.</p><p><strong>Conclusions: </strong>These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.</p>","PeriodicalId":13791,"journal":{"name":"International journal of dental hygiene","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of dental hygiene","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/idh.12848","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Large language models such as Gemini, GPT-3.5, and GPT-4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT-3.5, and GPT-4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.
Methods: This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019-2023). A two-way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.
Results: GPT-4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.
Conclusions: These findings indicate that GPT-4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts.
期刊介绍:
International Journal of Dental Hygiene is the official scientific peer-reviewed journal of the International Federation of Dental Hygienists (IFDH). The journal brings the latest scientific news, high quality commissioned reviews as well as clinical, professional and educational developmental and legislative news to the profession world-wide. Thus, it acts as a forum for exchange of relevant information and enhancement of the profession with the purpose of promoting oral health for patients and communities.
The aim of the International Journal of Dental Hygiene is to provide a forum for exchange of scientific knowledge in the field of oral health and dental hygiene. A further aim is to support and facilitate the application of new knowledge into clinical practice. The journal welcomes original research, reviews and case reports as well as clinical, professional, educational and legislative news to the profession world-wide.