Evaluation of Generative Artificial Intelligence Implementation Impacts in Social and Health Care Language Translation: Mixed Methods Case Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES
Miia Martikainen, Kari Smolander, Johan Sanmark, Enni Sanmark
{"title":"Evaluation of Generative Artificial Intelligence Implementation Impacts in Social and Health Care Language Translation: Mixed Methods Case Study.","authors":"Miia Martikainen, Kari Smolander, Johan Sanmark, Enni Sanmark","doi":"10.2196/73658","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Generative artificial intelligence (GAI) is expected to enhance the productivity of the public social and health care sector while maintaining, at minimum, current standards of quality and user experience. However, empirical evidence on GAI impacts in practical, real-life settings remains limited.</p><p><strong>Objective: </strong>This study investigates productivity, machine translation quality, and user experience impacts of the GPT-4 language model in an in-house language translation services team of a large well-being services county in Finland.</p><p><strong>Methods: </strong>A mixed methods study was conducted with 4 in-house translators between March and June 2024. Quantitative data of 908 translation segments were collected in real-life conditions using the computer-assisted language translation software Trados (RWS) to assess productivity differences between machine and human translation. Quality was measured using 4 automatic metrics (human-targeted translation edit rate, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Character n-gram F-score) applied to 1373 GAI-human segment pairs. User experience was investigated through 5 semistructured interviews, including the team supervisor.</p><p><strong>Results: </strong>The findings indicate that, on average, postediting machine translation is 14% faster than translating texts from scratch (2.75 vs 2.40 characters per second, P=.03), and up to 37% faster when the number of segments is equalized across translators. However, productivity varied notably between individuals, with improvements ranging from -2% to 102%. Regarding translation quality, 11% (141/1261) of Finnish-Swedish and 16% (18/112) of Finnish-English GAI outputs were accepted without edits. Average human-targeted translation edit rate scores were 55 (Swedish) and 46 (English), indicating that approximately half of the words required editing. Bilingual Evaluation Understudy scores averaged 43 for Swedish and 38 for English, suggesting good translation quality. Metric for Evaluation of Translation With Explicit Ordering and Character n-gram F-scores reached 63 and 68 for Swedish and 59 and 57 for English, respectively. All metrics have been converted to an equivalent scale from 0 to 100, with 100 reflecting a perfect match. Interviewed translators expressed mixed reviews on productivity gains but generally perceived value in using GAI, especially for repetitive, generic content. Identified challenges included inconsistent or incorrect terminology, lack of document-level context, and limited system customization.</p><p><strong>Conclusions: </strong>Based on this case study, GPT-4-based GAI shows measurable potential to enhance translation productivity and quality within an in-house translation team in the public social and health care sector. However, its effectiveness appears to be influenced by factors, such as translator postediting skills, workflow design, and organizational readiness. These findings suggest that, in similar contexts, public social and health care organizations could benefit from investing in translator training, optimizing technical integration, redesigning workflows, and implementing effective change management. Future research should examine larger translator teams to assess the generalizability of these results and further explore how translation quality and user experience can be improved through domain-specific customization.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e73658"},"PeriodicalIF":2.0000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12443352/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/73658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Generative artificial intelligence (GAI) is expected to enhance the productivity of the public social and health care sector while maintaining, at minimum, current standards of quality and user experience. However, empirical evidence on GAI impacts in practical, real-life settings remains limited.

Objective: This study investigates productivity, machine translation quality, and user experience impacts of the GPT-4 language model in an in-house language translation services team of a large well-being services county in Finland.

Methods: A mixed methods study was conducted with 4 in-house translators between March and June 2024. Quantitative data of 908 translation segments were collected in real-life conditions using the computer-assisted language translation software Trados (RWS) to assess productivity differences between machine and human translation. Quality was measured using 4 automatic metrics (human-targeted translation edit rate, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Character n-gram F-score) applied to 1373 GAI-human segment pairs. User experience was investigated through 5 semistructured interviews, including the team supervisor.

Results: The findings indicate that, on average, postediting machine translation is 14% faster than translating texts from scratch (2.75 vs 2.40 characters per second, P=.03), and up to 37% faster when the number of segments is equalized across translators. However, productivity varied notably between individuals, with improvements ranging from -2% to 102%. Regarding translation quality, 11% (141/1261) of Finnish-Swedish and 16% (18/112) of Finnish-English GAI outputs were accepted without edits. Average human-targeted translation edit rate scores were 55 (Swedish) and 46 (English), indicating that approximately half of the words required editing. Bilingual Evaluation Understudy scores averaged 43 for Swedish and 38 for English, suggesting good translation quality. Metric for Evaluation of Translation With Explicit Ordering and Character n-gram F-scores reached 63 and 68 for Swedish and 59 and 57 for English, respectively. All metrics have been converted to an equivalent scale from 0 to 100, with 100 reflecting a perfect match. Interviewed translators expressed mixed reviews on productivity gains but generally perceived value in using GAI, especially for repetitive, generic content. Identified challenges included inconsistent or incorrect terminology, lack of document-level context, and limited system customization.

Conclusions: Based on this case study, GPT-4-based GAI shows measurable potential to enhance translation productivity and quality within an in-house translation team in the public social and health care sector. However, its effectiveness appears to be influenced by factors, such as translator postediting skills, workflow design, and organizational readiness. These findings suggest that, in similar contexts, public social and health care organizations could benefit from investing in translator training, optimizing technical integration, redesigning workflows, and implementing effective change management. Future research should examine larger translator teams to assess the generalizability of these results and further explore how translation quality and user experience can be improved through domain-specific customization.

生成式人工智能在社会和卫生保健语言翻译中的应用效果评估:混合方法案例研究。
背景:生成式人工智能(GAI)有望提高公共社会和卫生保健部门的生产力,同时至少保持当前的质量和用户体验标准。然而,关于GAI在实际生活环境中的影响的经验证据仍然有限。目的:本研究在芬兰一个大型福利服务县的内部语言翻译服务团队中调查GPT-4语言模型的生产力、机器翻译质量和用户体验影响。方法:于2024年3月至6月对4名内部翻译进行混合方法研究。使用计算机辅助语言翻译软件Trados (RWS)收集908个翻译片段的定量数据,以评估机器和人工翻译的生产力差异。使用4个自动指标(人类目标翻译编辑率、双语评估替补、明确排序翻译评估指标和字符n-gram F-score)对1373个ai -人类片段对进行质量测量。通过包括团队主管在内的5个半结构化访谈来调查用户体验。结果:研究结果表明,平均而言,编辑后的机器翻译比从头开始翻译文本快14%(每秒2.75个字符vs 2.40个字符,P= 0.03),当翻译段数量相等时,速度可达37%。然而,个人之间的生产率差异很大,提高幅度从-2%到102%不等。在翻译质量方面,11%(141/1261)的芬兰语-瑞典语GAI输出和16%(18/112)的芬兰语-英语GAI输出未经编辑被接受。人类目标翻译的平均编辑率得分为55分(瑞典语)和46分(英语),这表明大约一半的单词需要编辑。双语评估替角的瑞典语平均得分为43分,英语平均得分为38分,表明翻译质量良好。瑞典语的显性排序和字符n-gram翻译评价指标f分分别为63分和68分,英语为59分和57分。所有指标都转换为从0到100的等效刻度,100表示完全匹配。接受采访的翻译人员对生产力的提高表达了不同的评价,但普遍认为使用GAI有价值,特别是对于重复的、通用的内容。确定的挑战包括不一致或不正确的术语、缺乏文档级上下文以及有限的系统定制。结论:基于本案例研究,基于gpt -4的GAI显示出可衡量的潜力,可以提高公共社会和医疗保健部门内部翻译团队的翻译生产力和质量。然而,其有效性似乎受到一些因素的影响,例如翻译人员的后期编辑技能、工作流程设计和组织准备情况。这些发现表明,在类似的情况下,公共社会和卫生保健组织可以从投资翻译培训、优化技术集成、重新设计工作流程和实施有效的变革管理中受益。未来的研究应该考察更大的翻译团队,以评估这些结果的普遍性,并进一步探索如何通过特定领域的定制来提高翻译质量和用户体验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信