Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2026-04-13 DOI:10.2196/85169

Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon

{"title":"Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.","authors":"Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon","doi":"10.2196/85169","DOIUrl":null,"url":null,"abstract":"Background: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored.Objective: This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese.Methods: Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics.Results: The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts.Conclusions: This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"10 ","pages":"e85169"},"PeriodicalIF":2.0000,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13075536/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/85169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored.

Objective: This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese.

Methods: Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics.

Results: The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts.

Conclusions: This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.

查看原文本刊更多论文

比较大语言模型和传统机器翻译工具翻译医疗会诊摘要：定量试点可行性研究。

背景：在文化和语言不同的人群中，医疗咨询摘要的翻译对于公平的卫生保健沟通至关重要。虽然机器翻译（MT）工具和大型语言模型（llm）广泛可用，但它们在医疗保健环境中的可行性和安全性仍未得到充分探索。目的：本试点研究探讨了使用llm和传统MT工具将医学会诊摘要从英语翻译成澳大利亚除英语以外最常用的语言——阿拉伯语、中文（简化书面形式）和越南语的可行性和局限性。方法：使用3个llm （gpt - 40, Llama-3.1和Gemma-2）和3个MT工具（谷歌Translate， Microsoft Bing Translator和DeepL）翻译两个模拟摘要-一个简单的面向患者的摘要和一个复杂的面向临床医生的跨专业信函。使用双语评估替补、字符级f分数和带有明确排序指标的翻译评估指标，对翻译进行基准测试。结果：当使用自动评估指标进行评估时，翻译性能因语言、工具和摘要复杂性而异。传统的机器翻译工具在表层指标上优于llm，而llm在越南语和汉语的语义相似度上表现出相对优势。阿拉伯文的翻译在复杂的输入下得到了改善，这表明了词法上的优势。基于指标的评估强调了可行性，但也强调了风险，特别是在中国的临床背景下。结论：这项初步研究提供了将人工智能翻译应用于医疗保健沟通的机会和局限性的形成性证据。调查结果强调了人为监督的重要性；特定于领域的评估指标；以及进一步的形成性和临床研究，以指导安全、公平地使用人工智能翻译工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊