Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES

Frontiers in digital health Pub Date : 2024-02-26 eCollection Date: 2024-01-01 DOI:10.3389/fdgth.2024.1211564

Lifeng Han, Serge Gladkoff, Gleb Erofeev, Irina Sorokina, Betty Galiano, Goran Nenadic

{"title":"Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.","authors":"Lifeng Han, Serge Gladkoff, Gleb Erofeev, Irina Sorokina, Betty Galiano, Goran Nenadic","doi":"10.3389/fdgth.2024.1211564","DOIUrl":null,"url":null,"abstract":"<p><p>Clinical text and documents contain very rich information and knowledge in healthcare, and their processing using state-of-the-art language technology becomes very important for building intelligent systems for supporting healthcare and social good. This processing includes creating language understanding models and translating resources into other natural languages to share domain-specific cross-lingual knowledge. In this work, we conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three sub-tasks including (1) clinical case (CC), (2) clinical terminology (CT), and (3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) outperformed the other two extra-large language models by a large margin in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation. Our data is openly available for research purposes at: https://github.com/HECTA-UoM/ClinicalNMT.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"6 ","pages":"1211564"},"PeriodicalIF":3.2000,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10926203/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2024.1211564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Clinical text and documents contain very rich information and knowledge in healthcare, and their processing using state-of-the-art language technology becomes very important for building intelligent systems for supporting healthcare and social good. This processing includes creating language understanding models and translating resources into other natural languages to share domain-specific cross-lingual knowledge. In this work, we conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three sub-tasks including (1) clinical case (CC), (2) clinical terminology (CT), and (3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) outperformed the other two extra-large language models by a large margin in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation. Our data is openly available for research purposes at: https://github.com/HECTA-UoM/ClinicalNMT.

查看原文本刊更多论文

临床文本的神经机器翻译：对多语言预训练语言模型和迁移学习的实证研究。

临床文本和文档包含非常丰富的医疗保健信息和知识，使用最先进的语言技术对其进行处理，对于构建支持医疗保健和社会公益的智能系统非常重要。这种处理包括创建语言理解模型，并将资源翻译成其他自然语言，以共享特定领域的跨语言知识。在这项工作中，我们通过研究基于 Transformer 结构等深度学习的多语言神经网络模型，对临床文本机器翻译进行了研究。此外，为了解决语言资源不平衡问题，我们还使用基于大规模多语言预训练语言模型（MMPLMs）的迁移学习方法进行了实验。在包括（1）临床病例（CC）、（2）临床术语（CT）和（3）本体概念（OC）在内的三个子任务上的实验结果表明，我们的模型在 ClinSpEn-2022 共享任务中的英语-西班牙语临床领域数据上取得了顶级性能。此外，我们基于专家的人工评估表明，小型预训练语言模型（PLM）在临床领域微调中的表现远远优于其他两个超大型语言模型，而这一发现在该领域从未有过报道。最后，在我们的实验环境中，迁移学习方法使用 WMT21fb 模型很好地适应了新的语言空间西班牙语，而这在 WMT21fb 本身的预训练阶段是看不到的。这些研究成果可以为特定领域的机器翻译开发提供一些启示，尤其是在临床和医疗保健领域。在我们工作的基础上，还可以开展进一步的研究项目，以改进医疗文本分析和知识转化。我们的数据可通过以下网址公开获取，用于研究目的：https://github.com/HECTA-UoM/ClinicalNMT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in digital health

CiteScore

4.20

自引率

0.00%

发文量

审稿时长

13 weeks