On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing

First Workshop on Insights from Negative Results in NLP Pub Date : 1900-01-01 DOI:10.18653/v1/2022.insights-1.12

Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo

{"title":"On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing","authors":"Itsuki Okimura, Machel Reid, M. Kawano, Yutaka Matsuo","doi":"10.18653/v1/2022.insights-1.12","DOIUrl":null,"url":null,"abstract":"With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.insights-1.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

With in the broader scope of machine learning, data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models. We find minimal improvements when data sizes are constrained to a few thousand, with performance degradation when data size is increased. We also use various methods to quantify the strength of data augmentations, and find that these values, though weakly correlated with downstream performance, correlate negatively or positively depending on the task.Furthermore, we find a glaring lack of consistently performant data augmentations. This all alludes to the difficulty of data augmentations for NLP tasks and we are inclined to believe that static data augmentations are not broadly applicable given these properties.

查看原文本刊更多论文

自然语言处理中数据增强对下游性能的影响

在机器学习的广泛范围内，数据增强是提高机器学习模型泛化和鲁棒性的常用策略。虽然数据增强在计算机视觉中得到了广泛的应用，但它在自然语言处理中的应用却相当有限。这是因为在NLP内部，提出的数据增强方法对性能的影响尚未得到统一的评估，有效的数据增强方法尚不清楚。在本文中，我们希望通过在调整预训练语言模型时评估12种数据增强方法对多个数据集的影响来解决这个问题。我们发现，当数据大小限制在几千个以内时，性能的改善微乎其微，而当数据大小增加时，性能会下降。我们还使用各种方法来量化数据增强的强度，并发现这些值虽然与下游性能弱相关，但根据任务负相关或正相关。此外，我们发现明显缺乏一致的性能数据增强。这些都暗示了NLP任务中数据增强的困难，我们倾向于认为静态数据增强在给定这些属性的情况下并不广泛适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量