Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-06-27 DOI:10.2196/69485

Zhong Yao, Liantan Duan, Shuo Xu, Lingyi Chi, Dongfang Sheng

{"title":"Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.","authors":"Zhong Yao, Liantan Duan, Shuo Xu, Lingyi Chi, Dongfang Sheng","doi":"10.2196/69485","DOIUrl":null,"url":null,"abstract":"Background: Research on large language models (LLMs) in the medical field has predominantly focused on models trained with English-language corpora, evaluating their performance within English-speaking contexts. The performances of models trained with non-English language corpora and their performance in non-English contexts remain underexplored.Objective: This study aimed to evaluate the performances of LLMs trained on different languages corpora by using the Chinese National Medical Licensing Examination (CNMLE) as a benchmark and constructed analogous questions.Methods: Under different prompt settings, we sequentially posed questions to 7 LLMs: 2 primarily trained on English-language corpora and 5 primarily on Chinese-language corpora. The models' responses were compared against standard answers to calculate the accuracy rate of each model. Further subgroup analyses were conducted by categorizing the questions based on various criteria. We also collected error sets to explore patterns of mistakes across different models.Results: Under the zero-shot setting, 6 out of 7 models exceeded the passing level, with the highest accuracy rate achieved by the Chinese LLM Baichuan (86.67%), followed by ChatGPT (83.83%). In the constructed questions, all 7 models exceeded the passing threshold, with Baichuan maintaining the highest accuracy rate (87.00%). In few-shot learning, all models exceeded the passing threshold. Baichuan, ChatGLM, and ChatGPT retained the highest accuracy. While Llama showed marked improvement over previous tests, the relative performance rankings of other models stayed similar to previous results. In subgroup analyses, English models demonstrated comparable or superior performance to Chinese models on questions related to ethics and policy. All models except Llama generally had higher accuracy rates for simple questions than for complex ones. The error set of ChatGPT was similar to those of other Chinese models. Multimodel cross-verification outperformed single model, particularly improving accuracy rate on simple questions. The implementation of dual-model and tri-model verification achieved accuracy rates of 94.17% and 96.33% respectively.Conclusions: At the current level, LLMs trained primarily on English corpora and those trained mainly on Chinese corpora perform similarly well in CNMLE, with Chinese models still outperforming. The performance difference between ChatGPT and other Chinese LLMs are not solely due to communication barriers but are more likely influenced by disparities in the training data. By using a method of cross-verification with multiple LLMs, excellent performance can be achieved in medical examinations.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e69485"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12227152/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/69485","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Research on large language models (LLMs) in the medical field has predominantly focused on models trained with English-language corpora, evaluating their performance within English-speaking contexts. The performances of models trained with non-English language corpora and their performance in non-English contexts remain underexplored.

Objective: This study aimed to evaluate the performances of LLMs trained on different languages corpora by using the Chinese National Medical Licensing Examination (CNMLE) as a benchmark and constructed analogous questions.

Methods: Under different prompt settings, we sequentially posed questions to 7 LLMs: 2 primarily trained on English-language corpora and 5 primarily on Chinese-language corpora. The models' responses were compared against standard answers to calculate the accuracy rate of each model. Further subgroup analyses were conducted by categorizing the questions based on various criteria. We also collected error sets to explore patterns of mistakes across different models.

Results: Under the zero-shot setting, 6 out of 7 models exceeded the passing level, with the highest accuracy rate achieved by the Chinese LLM Baichuan (86.67%), followed by ChatGPT (83.83%). In the constructed questions, all 7 models exceeded the passing threshold, with Baichuan maintaining the highest accuracy rate (87.00%). In few-shot learning, all models exceeded the passing threshold. Baichuan, ChatGLM, and ChatGPT retained the highest accuracy. While Llama showed marked improvement over previous tests, the relative performance rankings of other models stayed similar to previous results. In subgroup analyses, English models demonstrated comparable or superior performance to Chinese models on questions related to ethics and policy. All models except Llama generally had higher accuracy rates for simple questions than for complex ones. The error set of ChatGPT was similar to those of other Chinese models. Multimodel cross-verification outperformed single model, particularly improving accuracy rate on simple questions. The implementation of dual-model and tri-model verification achieved accuracy rates of 94.17% and 96.33% respectively.

Conclusions: At the current level, LLMs trained primarily on English corpora and those trained mainly on Chinese corpora perform similarly well in CNMLE, with Chinese models still outperforming. The performance difference between ChatGPT and other Chinese LLMs are not solely due to communication barriers but are more likely influenced by disparities in the training data. By using a method of cross-verification with multiple LLMs, excellent performance can be achieved in medical examinations.

查看原文本刊更多论文

大型语言模型在非英语语境下的表现：中国医学考试中不同语言训练模型的定性研究。

背景：医学领域对大型语言模型（llm）的研究主要集中在用英语语料库训练的模型上，评估它们在英语环境中的表现。用非英语语料库训练的模型的性能及其在非英语语境中的表现仍未得到充分的研究。目的：以中国医师执业资格考试（CNMLE）为基准，构建类比题，评价在不同语言语料库上训练的法学硕士的表现。方法：在不同的提示设置下，我们依次向7位法学硕士提出问题，其中2位主要接受英语语料库训练，5位主要接受汉语语料库训练。将模型的回答与标准答案进行比较，以计算每个模型的准确率。进一步的亚组分析是根据不同的标准对问题进行分类。我们还收集了错误集，以探索不同模型中的错误模式。结果：在零弹设置下，7个模型中有6个超过了及格水平，其中中国法学硕士白川的准确率最高（86.67%），其次是ChatGPT（83.83%）。在构建的问题中，7个模型均超过了通过阈值，其中白川模型的准确率最高（87.00%）。在几次学习中，所有模型都超过了及格阈值。白川、ChatGLM和ChatGPT的准确率最高。虽然Llama在之前的测试中表现出明显的进步，但其他车型的相对性能排名与之前的结果相似。在亚组分析中，英语模型在与伦理和政策相关的问题上表现出与中国模型相当或更好的表现。除Llama外，所有模型在简单问题上的准确率都高于复杂问题。ChatGPT的误差集与其他中文模型相似。多模型交叉验证优于单一模型，特别是在简单问题上提高准确率。双模型验证和三模型验证的准确率分别为94.17%和96.33%。结论：在目前的水平上，主要以英语语料库训练的法学硕士和主要以中文语料库训练的法学硕士在CNMLE中的表现相似，中文模型的表现仍然优于中文模型。ChatGPT与其他中国法学硕士之间的性能差异不仅仅是由于沟通障碍，更有可能是受训练数据差异的影响。通过使用多个llm交叉验证的方法，可以在医学检查中取得优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.