Distilling BERT knowledge into Seq2Seq with regularized Mixup for low-resource neural machine translation

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2024-09-07 DOI:10.1016/j.eswa.2024.125314

Guanghua Zhang , Hua Liu , Junjun Guo , Tianyu Guo

{"title":"Distilling BERT knowledge into Seq2Seq with regularized Mixup for low-resource neural machine translation","authors":"Guanghua Zhang , Hua Liu , Junjun Guo , Tianyu Guo","doi":"10.1016/j.eswa.2024.125314","DOIUrl":null,"url":null,"abstract":"<div><p>Pre-trained language models, such as Bidirectional Encoder Representations from Transformers (BERT), have demonstrated state-of-the-art performance in many Natural Language Processing (NLP) downstream tasks. Incorporating pre-trained BERT knowledge into the Sequence-to-Sequence (Seq2Seq) model can significantly enhance machine translation performance, particularly for low-resource language pairs. However, most previous studies prefer to fine-tune both the large pre-trained BERT model and the Seq2Seq model jointly, leading to costly training times, especially with limited parallel data pairs. Consequently, the integration of pre-trained BERT contextual representations into the Seq2Seq framework is limited. In this paper, we propose a simple and effective BERT knowledge fusion approach based on regularized Mixup for low-resource Neural Machine Translation (NMT), referred to as ReMixup-NMT, which constrains the distributions of the normal Transformer encoder and the Mixup-based Transformer encoder to be consistent. The proposed ReMixup NMT approach is able to distill and fuse the pre-trained BERT knowledge into Seq2Seq NMT architecture in an efficient manner with non-additional parameters training. Experiment results on six low-resource NMT tasks show the proposed approach outperforms the state-of-the-art (SOTA) BERT-fused and drop-based methods on IWSLT’15 English<span><math><mo>→</mo></math></span>Vietnamese and IWSLT’17 English<span><math><mo>→</mo></math></span>French datasets.</p></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"259 ","pages":"Article 125314"},"PeriodicalIF":7.5000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742402181X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-trained language models, such as Bidirectional Encoder Representations from Transformers (BERT), have demonstrated state-of-the-art performance in many Natural Language Processing (NLP) downstream tasks. Incorporating pre-trained BERT knowledge into the Sequence-to-Sequence (Seq2Seq) model can significantly enhance machine translation performance, particularly for low-resource language pairs. However, most previous studies prefer to fine-tune both the large pre-trained BERT model and the Seq2Seq model jointly, leading to costly training times, especially with limited parallel data pairs. Consequently, the integration of pre-trained BERT contextual representations into the Seq2Seq framework is limited. In this paper, we propose a simple and effective BERT knowledge fusion approach based on regularized Mixup for low-resource Neural Machine Translation (NMT), referred to as ReMixup-NMT, which constrains the distributions of the normal Transformer encoder and the Mixup-based Transformer encoder to be consistent. The proposed ReMixup NMT approach is able to distill and fuse the pre-trained BERT knowledge into Seq2Seq NMT architecture in an efficient manner with non-additional parameters training. Experiment results on six low-resource NMT tasks show the proposed approach outperforms the state-of-the-art (SOTA) BERT-fused and drop-based methods on IWSLT’15 English $\to$ Vietnamese and IWSLT’17 English $\to$ French datasets.

Abstract Image

查看原文本刊更多论文

利用正则化 Mixup 在 Seq2Seq 中提炼 BERT 知识，实现低资源神经机器翻译

在许多自然语言处理（NLP）下游任务中，预先训练的语言模型（如来自变换器的双向编码器表示法（BERT））都表现出了最先进的性能。将预训练的 BERT 知识纳入序列到序列（Seq2Seq）模型可以显著提高机器翻译性能，尤其是在低资源语言对中。然而，之前的大多数研究倾向于联合微调大型预训练 BERT 模型和 Seq2Seq 模型，这导致了昂贵的训练时间，尤其是在并行数据对有限的情况下。因此，将预先训练好的 BERT 上下文表征整合到 Seq2Seq 框架中的可能性有限。在本文中，我们为低资源神经机器翻译（NMT）提出了一种基于正则化 Mixup 的简单而有效的 BERT 知识融合方法，简称 ReMixup-NMT，该方法约束普通 Transformer 编码器和基于 Mixup 的 Transformer 编码器的分布保持一致。所提出的 ReMixup NMT 方法能够在不增加参数训练的情况下，以高效的方式将预先训练的 BERT 知识提炼并融合到 Seq2Seq NMT 架构中。六项低资源 NMT 任务的实验结果表明，在 IWSLT'15 英语→越南语和 IWSLT'17 英语→法语数据集上，所提出的方法优于最先进（SOTA）的 BERT 融合方法和基于 drop 的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.