评估GPT和DeepL在医学领域的术语翻译:人类表型本体的比较研究。

IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS
Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf
{"title":"评估GPT和DeepL在医学领域的术语翻译:人类表型本体的比较研究。","authors":"Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf","doi":"10.1186/s12911-025-03075-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.</p><p><strong>Methods: </strong>This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.</p><p><strong>Results: </strong>The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.</p><p><strong>Conclusions: </strong>Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"237"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12220062/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing GPT and DeepL for terminology translation in the medical domain: A comparative study on the human phenotype ontology.\",\"authors\":\"Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf\",\"doi\":\"10.1186/s12911-025-03075-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.</p><p><strong>Methods: </strong>This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.</p><p><strong>Results: </strong>The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.</p><p><strong>Conclusions: </strong>Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"25 1\",\"pages\":\"237\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12220062/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-025-03075-8\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03075-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:本文介绍了两种最先进的语言模型,OpenAI的GPT和DeepL,在医学领域的术语翻译背景下的比较研究。方法:对用于医学研究和诊断的人类表型本体论(HPO)进行研究。医学专家采用李克特4分量表(非常同意= 1,同意= 2,不同意= 3,非常不同意= 4),对120个翻译的HPO术语及其180个同义词进行了两种模型的性能评估。使用来自HeTOP数据库的独立参考翻译来验证翻译的质量。结果:GPT-3.5的平均Likert评分为1.29,DeepL的平均Likert评分为1.37。对于本体深度较大的多词术语,翻译质量也令人满意。通过与HeTOP的比较,发现模型翻译与参考翻译高度相似。结论:通过统计分析,两种翻译模型的平均评分差异不显著,说明两种模型在翻译质量上具有可比性。这项研究不仅说明了机器翻译的潜力,而且显示了翻译医学术语的不完整覆盖。这强调了本研究与跨语言医学研究的相关性。但是,评估方法需要进一步完善,具体的翻译问题需要解决,样本量需要增加,以便得出更普遍的结论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessing GPT and DeepL for terminology translation in the medical domain: A comparative study on the human phenotype ontology.

Background: This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.

Methods: This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.

Results: The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.

Conclusions: Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信