Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES
Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon
{"title":"Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.","authors":"Andy Li, Wei Zhou, Rashina Hoda, Chris Bain, Peter Poon","doi":"10.2196/85169","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored.</p><p><strong>Objective: </strong>This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese.</p><p><strong>Methods: </strong>Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics.</p><p><strong>Results: </strong>The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts.</p><p><strong>Conclusions: </strong>This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"10 ","pages":"e85169"},"PeriodicalIF":2.0000,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13075536/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/85169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored.

Objective: This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese.

Methods: Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics.

Results: The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts.

Conclusions: This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.

比较大语言模型和传统机器翻译工具翻译医疗会诊摘要:定量试点可行性研究。
背景:在文化和语言不同的人群中,医疗咨询摘要的翻译对于公平的卫生保健沟通至关重要。虽然机器翻译(MT)工具和大型语言模型(llm)广泛可用,但它们在医疗保健环境中的可行性和安全性仍未得到充分探索。目的:本试点研究探讨了使用llm和传统MT工具将医学会诊摘要从英语翻译成澳大利亚除英语以外最常用的语言——阿拉伯语、中文(简化书面形式)和越南语的可行性和局限性。方法:使用3个llm (gpt - 40, Llama-3.1和Gemma-2)和3个MT工具(谷歌Translate, Microsoft Bing Translator和DeepL)翻译两个模拟摘要-一个简单的面向患者的摘要和一个复杂的面向临床医生的跨专业信函。使用双语评估替补、字符级f分数和带有明确排序指标的翻译评估指标,对翻译进行基准测试。结果:当使用自动评估指标进行评估时,翻译性能因语言、工具和摘要复杂性而异。传统的机器翻译工具在表层指标上优于llm,而llm在越南语和汉语的语义相似度上表现出相对优势。阿拉伯文的翻译在复杂的输入下得到了改善,这表明了词法上的优势。基于指标的评估强调了可行性,但也强调了风险,特别是在中国的临床背景下。结论:这项初步研究提供了将人工智能翻译应用于医疗保健沟通的机会和局限性的形成性证据。调查结果强调了人为监督的重要性;特定于领域的评估指标;以及进一步的形成性和临床研究,以指导安全、公平地使用人工智能翻译工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书