Can machine translation match human expertise? Quantifying the performance of large language models in the translation of patient-reported outcome measures (PROMs).
Sheng-Chieh Lu, Cai Xu, Manraj Kaur, Maria Orlando Edelen, Andrea Pusic, Chris Gibbons
{"title":"Can machine translation match human expertise? Quantifying the performance of large language models in the translation of patient-reported outcome measures (PROMs).","authors":"Sheng-Chieh Lu, Cai Xu, Manraj Kaur, Maria Orlando Edelen, Andrea Pusic, Chris Gibbons","doi":"10.1186/s41687-025-00926-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The rise in artificial intelligence tools, especially those competent at language interpretation and translation, enables opportunities to enhance patient-centered care. One might be the ability to rapidly and inexpensively create accurate translations of English language patient-reported outcome measures (PROMs) to facilitate global uptake. Currently, it is unclear if machine translation (MT) tools can produce sufficient translation quality for this purpose.</p><p><strong>Methodology: </strong>We used Generative Pretrained Transformer (GPT)-4, GPT-3.5, and Google Translate to translate the English versions of selected scales from the Breast-Q and Face-Q, two widely used PROMs assessing outcomes following breast and face reconstructive surgery, respectively. We used MT to forward and back translate the scales from English into Arabic, Vietnamese, Italian, Hungarian, Malay, and Dutch. We compared translation quality using the Metrics for Evaluation of Translation with Explicit Ordering (METEOR). We compared the scores between different translation versions using the Kruskal-Wallis test or analysis of variance as appropriate.</p><p><strong>Results: </strong>In forward translations, the METEOR scores significantly varied depending on target languages for all MT tools (p < 0.001), with GPT-4 having the highest scores in most languages. We detected significantly different scores among translators for all languages (p < .05), except for Italian (p = 0.59). In backward translations, MTs (GPT-4: 0.81 ± 0.10; GPT-3.5: 0.78 ± 0.12; Google Translate: 0.80 ± 0.06) received higher or compatible scores to human translations (0.76 ± 0.11) for all languages. The differences in backward translation scores by different forward translators were significant for all languages (p < 0.01; except for Italian, p = 0.2). The scores between different languages were also significantly different for all translators (p < 0.001).</p><p><strong>Conclusions: </strong>Our findings suggest that large language models provide high-quality PROM translations to support human translations to reduce costs. However, substituting human translation with MT is not advisable at the current stage.</p>","PeriodicalId":36660,"journal":{"name":"Journal of Patient-Reported Outcomes","volume":"9 1","pages":"94"},"PeriodicalIF":2.4000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Patient-Reported Outcomes","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41687-025-00926-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The rise in artificial intelligence tools, especially those competent at language interpretation and translation, enables opportunities to enhance patient-centered care. One might be the ability to rapidly and inexpensively create accurate translations of English language patient-reported outcome measures (PROMs) to facilitate global uptake. Currently, it is unclear if machine translation (MT) tools can produce sufficient translation quality for this purpose.
Methodology: We used Generative Pretrained Transformer (GPT)-4, GPT-3.5, and Google Translate to translate the English versions of selected scales from the Breast-Q and Face-Q, two widely used PROMs assessing outcomes following breast and face reconstructive surgery, respectively. We used MT to forward and back translate the scales from English into Arabic, Vietnamese, Italian, Hungarian, Malay, and Dutch. We compared translation quality using the Metrics for Evaluation of Translation with Explicit Ordering (METEOR). We compared the scores between different translation versions using the Kruskal-Wallis test or analysis of variance as appropriate.
Results: In forward translations, the METEOR scores significantly varied depending on target languages for all MT tools (p < 0.001), with GPT-4 having the highest scores in most languages. We detected significantly different scores among translators for all languages (p < .05), except for Italian (p = 0.59). In backward translations, MTs (GPT-4: 0.81 ± 0.10; GPT-3.5: 0.78 ± 0.12; Google Translate: 0.80 ± 0.06) received higher or compatible scores to human translations (0.76 ± 0.11) for all languages. The differences in backward translation scores by different forward translators were significant for all languages (p < 0.01; except for Italian, p = 0.2). The scores between different languages were also significantly different for all translators (p < 0.001).
Conclusions: Our findings suggest that large language models provide high-quality PROM translations to support human translations to reduce costs. However, substituting human translation with MT is not advisable at the current stage.