Mondira Ray,Daniel J Kats,Joss Moorkens,Dinesh Rai,Nate Shaar,Diane Quinones,Alejandro Vermeulen,Camila M Mateo,Ryan C L Brewster,Alisa Khan,Benjamin Rader,John S Brownstein,Jonathan D Hron
{"title":"Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework.","authors":"Mondira Ray,Daniel J Kats,Joss Moorkens,Dinesh Rai,Nate Shaar,Diane Quinones,Alejandro Vermeulen,Camila M Mateo,Ryan C L Brewster,Alisa Khan,Benjamin Rader,John S Brownstein,Jonathan D Hron","doi":"10.1001/jamapediatrics.2025.1729","DOIUrl":null,"url":null,"abstract":"Importance\r\nPatients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI's GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met.\r\n\r\nObjective\r\nTo determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators.\r\n\r\nDesign, Setting, and Participants\r\nThis cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators.\r\n\r\nExposure\r\nPatient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators.\r\n\r\nMain Outcomes and Measures\r\nThe primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors.\r\n\r\nResults\r\nThis study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations.\r\n\r\nConclusions and Relevance\r\nIn this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.","PeriodicalId":14683,"journal":{"name":"JAMA Pediatrics","volume":"47 1","pages":""},"PeriodicalIF":18.0000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA Pediatrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamapediatrics.2025.1729","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PEDIATRICS","Score":null,"Total":0}
引用次数: 0
Abstract
Importance
Patients and caregivers who use languages other than English in the US encounter barriers to accessing language-concordant written instructions after clinical visits. Large language models (LLMs), such as OpenAI's GPT-4o, may improve access to translated patient materials; however, rigorous evaluation is needed to ensure clinical standards are met.
Objective
To determine whether GPT-4o can generate high-quality Spanish translations of personalized patient instructions comparable to those performed by professional human translators.
Design, Setting, and Participants
This cross-sectional study compared LLM translations to professional human translations using equivalence testing. The personalized pediatric instructions used were derived from real clinical encounters at a large US academic medical center and translated between January 2023 and December 2023. Patient instructions in English were translated into Spanish by GPT-4o and professional human translators. The source English texts were translated using GPT-4o on August 2, 2024. Both sets of translations were evaluated by 3 independent professional medical translators.
Exposure
Patient instructions were translated using GPT-4o with an engineered prompt, and these translations were compared with those produced by professional human translators.
Main Outcomes and Measures
The primary outcome was translation quality, assessed using the Multidimensional Quality Metrics (MQM) framework to generate an overall MQM score (rated on a 0-100 scale). Secondary outcomes included a general preference rating and error rates for types of translation errors.
Results
This study included 20 source files of pediatric patient instructions. Equivalence testing showed no significant difference in translation quality between GPT-4o and human translations, with a mean difference of 1.6 points (90% CI, 0.7-2.5), falling within a predefined equivalence margin of plus or minus 5 MQM points. The LLM yielded fewer mistranslation errors, and a mean (SE) of 52% (6%) of professional translator ratings preferred the LLM translations.
Conclusions and Relevance
In this cross-sectional study, GPT-4o generated Spanish translations of pediatric patient instructions that were comparable in quality to those by professional human translators as evaluated using a standardized framework. While human review of LLM translation remains essential in health care, these findings suggest that GPT-4o could reduce the translation workload for Spanish, potentially freeing resources to support languages of lesser diffusion.
期刊介绍:
JAMA Pediatrics, the oldest continuously published pediatric journal in the US since 1911, is an international peer-reviewed publication and a part of the JAMA Network. Published weekly online and in 12 issues annually, it garners over 8.4 million article views and downloads yearly. All research articles become freely accessible online after 12 months without any author fees, and through the WHO's HINARI program, the online version is accessible to institutions in developing countries.
With a focus on advancing the health of infants, children, and adolescents, JAMA Pediatrics serves as a platform for discussing crucial issues and policies in child and adolescent health care. Leveraging the latest technology, it ensures timely access to information for its readers worldwide.