Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf
{"title":"评估GPT和DeepL在医学领域的术语翻译:人类表型本体的比较研究。","authors":"Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf","doi":"10.1186/s12911-025-03075-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.</p><p><strong>Methods: </strong>This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.</p><p><strong>Results: </strong>The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.</p><p><strong>Conclusions: </strong>Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"237"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12220062/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing GPT and DeepL for terminology translation in the medical domain: A comparative study on the human phenotype ontology.\",\"authors\":\"Richard Noll, Alexandra Berger, Dominik Kieu, Tobias Mueller, Ferdinand O Bohmann, Angelina Müller, Svea Holtz, Philipp Stoffers, Sebastian Hoehl, Oya Guengoeze, Jan-Niklas Eckardt, Holger Storf, Jannik Schaaf\",\"doi\":\"10.1186/s12911-025-03075-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.</p><p><strong>Methods: </strong>This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.</p><p><strong>Results: </strong>The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.</p><p><strong>Conclusions: </strong>Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"25 1\",\"pages\":\"237\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12220062/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-025-03075-8\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03075-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
Assessing GPT and DeepL for terminology translation in the medical domain: A comparative study on the human phenotype ontology.
Background: This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain.
Methods: This study was conducted on the human phenotype ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms and their 180 synonyms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation.
Results: The average Likert rating for the selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The quality of the translations was also found to be satisfactory for multi-word terms with greater ontological depth. The comparison with HeTOP revealed a high degree of similarity between the models' translations and the reference translations.
Conclusions: Statistical analysis revealed no significant differences in the mean ratings between the two models, indicating their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, specific translation issues need to be addressed, and the sample size needs to be increased to allow for more generalizable conclusions.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.