将 CT 和 MRI 自由文本放射学报告翻译成多种语言的大语言模型能力。

IF 12.1 1区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Radiology Pub Date : 2024-12-01 DOI:10.1148/radiol.241736

Aymen Meddeb, Sophia Lüken, Felix Busch, Lisa Adams, Lorenzo Ugga, Emmanouil Koltsakis, Antonios Tzortzakakis, Soumaya Jelassi, Insaf Dkhil, Michail E Klontzas, Matthaios Triantafyllou, Burak Kocak, Sabahattin Yüzkan, Longjiang Zhang, Bin Hu, Anna Andreychenko, Efimtcev Alexander Yurievich, Tatiana Logunova, Wipawee Morakote, Salita Angkurawaranon, Marcus R Makowski, Mike P Wattjes, Renato Cuocolo, Keno Bressem

{"title":"将 CT 和 MRI 自由文本放射学报告翻译成多种语言的大语言模型能力。","authors":"Aymen Meddeb, Sophia Lüken, Felix Busch, Lisa Adams, Lorenzo Ugga, Emmanouil Koltsakis, Antonios Tzortzakakis, Soumaya Jelassi, Insaf Dkhil, Michail E Klontzas, Matthaios Triantafyllou, Burak Kocak, Sabahattin Yüzkan, Longjiang Zhang, Bin Hu, Anna Andreychenko, Efimtcev Alexander Yurievich, Tatiana Logunova, Wipawee Morakote, Salita Angkurawaranon, Marcus R Makowski, Mike P Wattjes, Renato Cuocolo, Keno Bressem","doi":"10.1148/radiol.241736","DOIUrl":null,"url":null,"abstract":"Background High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair. © RSNA, 2024 Supplemental material is available for this article.","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"313 3","pages":"e241736"},"PeriodicalIF":12.1000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.\",\"authors\":\"Aymen Meddeb, Sophia Lüken, Felix Busch, Lisa Adams, Lorenzo Ugga, Emmanouil Koltsakis, Antonios Tzortzakakis, Soumaya Jelassi, Insaf Dkhil, Michail E Klontzas, Matthaios Triantafyllou, Burak Kocak, Sabahattin Yüzkan, Longjiang Zhang, Bin Hu, Anna Andreychenko, Efimtcev Alexander Yurievich, Tatiana Logunova, Wipawee Morakote, Salita Angkurawaranon, Marcus R Makowski, Mike P Wattjes, Renato Cuocolo, Keno Bressem\",\"doi\":\"10.1148/radiol.241736\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair. © RSNA, 2024 Supplemental material is available for this article.\",\"PeriodicalId\":20896,\"journal\":{\"name\":\"Radiology\",\"volume\":\"313 3\",\"pages\":\"e241736\"},\"PeriodicalIF\":12.1000,\"publicationDate\":\"2024-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1148/radiol.241736\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241736","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

背景：高质量的放射学报告翻译对患者的最佳护理至关重要。由于具有医学专业知识的人工翻译人员的可用性有限，大型语言模型（llm）是一个很有前途的解决方案，但它们翻译放射学报告的能力在很大程度上仍未得到探索。目的评价各种法学硕士翻译高资源语言（英语、意大利语、法语、德语和中文）和低资源语言（瑞典语、土耳其语、俄语、希腊语和泰语）放射学报告的准确性和质量。材料和方法由18名放射科医生在2024年1月14日至5月2日期间将100份CT和MRI扫描的合成自由文本放射学报告数据集翻译成9种目标语言。10个llm，包括GPT-4 (OpenAI), Llama 3 （Meta）和Mixtral模型（Mistral AI），用于自动翻译。使用双语评估替补（BLEU）评分、翻译错误率（TER）和字符水平f -评分（chrf++）指标评估翻译准确性和质量。使用配对t检验和Holm-Bonferroni校正来评估统计显著性。放射科医生还使用标准化问卷对翻译进行了定性评估。结果GPT-4整体翻译质量最好，尤其是英译德翻译(BLEU评分：35.0±16.3 [SD]；Ter: 61.7±21.2；chrf++: 70.6±9.4)，希腊语(BLEU: 32.6±10.1；Ter: 52.4±10.6；chrf++: 62.8±6.4)，到泰语(BLEU: 53.2±7.3；Ter: 74.3±5.2；chrf++: 48.4±6.6)，土耳其语(BLEU: 35.5±6.6；Ter: 52.7±7.4；chrf++: 70.7±3.7)。GPT-3.5在英法翻译中准确率最高，Qwen1.5在英中翻译中表现出色，而Mixtral 8x22B在意英翻译中表现最好。定性评价显示，法学硕士在清晰度、可读性和与原意的一致性方面表现出色，但医学术语准确性中等。结论llm翻译放射学报告具有较高的准确性和质量，尽管结果因模型和语言对而异。©RSNA， 2024本文可获得补充材料。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.

Background High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair. © RSNA, 2024 Supplemental material is available for this article.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radiology 医学-核医学

CiteScore

35.20

自引率

3.00%

发文量

596

审稿时长

3.6 months

期刊介绍： Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.