Data augmentation approaches in natural language processing: A survey

IF 14.8

AI Open Pub Date : 2022-01-01 DOI:10.1016/j.aiopen.2022.03.001

Bohan Li, Yutai Hou, Wanxiang Che

引用次数: 118

Abstract

As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some useful resources are provided in Appendix A.

查看原文本刊更多论文

自然语言处理中的数据扩充方法：一项调查

作为一种有效的策略，数据增强（DA）可以缓解深度学习技术可能失败的数据短缺情况。它被广泛应用于计算机视觉，然后被引入到自然语言处理中，并在许多任务中实现了改进。DA方法的主要焦点之一是提高训练数据的多样性，从而帮助模型更好地推广到看不见的测试数据。在这项调查中，我们根据扩增数据的多样性将DA方法分为三类，包括转述、噪声和采样。我们的论文开始根据以上类别详细分析DA方法。此外，我们还介绍了它们在NLP任务中的应用以及面临的挑战。附录A提供了一些有用的资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AI Open

CiteScore

45.00

自引率

0.00%

发文量