Pascal Block, Johanna Schaefer, Felix Maurer, Holger Storf
{"title":"Quality of Machine Translations in Medical Texts: An Analysis Based on Standardised Evaluation Metrics.","authors":"Pascal Block, Johanna Schaefer, Felix Maurer, Holger Storf","doi":"10.3233/SHTI251380","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The medical care of patients with rare diseases is a cross-border concern across the EU. This is also reflected in the usage statistics of the SE-ATLAS, where most access occurs via browser languages set to German, English, French, or Polish. The SE-ATLAS website provides information on healthcare services and patient organisations for rare diseases in Germany. As SE-ATLAS currently offers its content almost exclusively in German, non-German-speaking users may encounter language barriers. Against this background, this paper explores whether common machine translation systems can translate medical texts into other languages at a reasonable level of quality.</p><p><strong>Methods: </strong>For this purpose, the translation systems DeepL, ChatGPT, and Google Translate were analysed. Translation quality was assessed using the standardised metrics BLEU, METEOR, and COMET. In contrast to subjective human assessments, these automated metrics allow for objective and reproducible evaluation. The analysis focused on machine-generated translations of German-language texts from the OPUS corpus into English, French, and Polish, each compared against existing reference translations.</p><p><strong>Results: </strong>BLEU scores were generally lower than those of the other metrics, whereas METEOR and COMET indicated moderate to high translation quality. Translations into English were consistently rated higher than those into French and Polish.</p><p><strong>Conclusion: </strong>As the three analysed translation systems showed hardly any statistically significant differences in translation quality and all delivered acceptable results, further criteria should be taken into account when choosing an appropriate system. These include factors such as data protection, cost-efficiency, and ease of integration.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"63-72"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: The medical care of patients with rare diseases is a cross-border concern across the EU. This is also reflected in the usage statistics of the SE-ATLAS, where most access occurs via browser languages set to German, English, French, or Polish. The SE-ATLAS website provides information on healthcare services and patient organisations for rare diseases in Germany. As SE-ATLAS currently offers its content almost exclusively in German, non-German-speaking users may encounter language barriers. Against this background, this paper explores whether common machine translation systems can translate medical texts into other languages at a reasonable level of quality.
Methods: For this purpose, the translation systems DeepL, ChatGPT, and Google Translate were analysed. Translation quality was assessed using the standardised metrics BLEU, METEOR, and COMET. In contrast to subjective human assessments, these automated metrics allow for objective and reproducible evaluation. The analysis focused on machine-generated translations of German-language texts from the OPUS corpus into English, French, and Polish, each compared against existing reference translations.
Results: BLEU scores were generally lower than those of the other metrics, whereas METEOR and COMET indicated moderate to high translation quality. Translations into English were consistently rated higher than those into French and Polish.
Conclusion: As the three analysed translation systems showed hardly any statistically significant differences in translation quality and all delivered acceptable results, further criteria should be taken into account when choosing an appropriate system. These include factors such as data protection, cost-efficiency, and ease of integration.