自然语言处理中的数据扩充方法:一项调查

Bohan Li, Yutai Hou, Wanxiang Che
{"title":"自然语言处理中的数据扩充方法:一项调查","authors":"Bohan Li,&nbsp;Yutai Hou,&nbsp;Wanxiang Che","doi":"10.1016/j.aiopen.2022.03.001","DOIUrl":null,"url":null,"abstract":"<div><p>As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the <strong>diversity</strong> of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some useful resources are provided in Appendix A.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"3 ","pages":"Pages 71-90"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666651022000080/pdfft?md5=daaa1ffcdc6b6cb892dcef9a8b3ee29a&pid=1-s2.0-S2666651022000080-main.pdf","citationCount":"118","resultStr":"{\"title\":\"Data augmentation approaches in natural language processing: A survey\",\"authors\":\"Bohan Li,&nbsp;Yutai Hou,&nbsp;Wanxiang Che\",\"doi\":\"10.1016/j.aiopen.2022.03.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the <strong>diversity</strong> of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some useful resources are provided in Appendix A.</p></div>\",\"PeriodicalId\":100068,\"journal\":{\"name\":\"AI Open\",\"volume\":\"3 \",\"pages\":\"Pages 71-90\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666651022000080/pdfft?md5=daaa1ffcdc6b6cb892dcef9a8b3ee29a&pid=1-s2.0-S2666651022000080-main.pdf\",\"citationCount\":\"118\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666651022000080\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651022000080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 118

摘要

作为一种有效的策略,数据增强(DA)可以缓解深度学习技术可能失败的数据短缺情况。它被广泛应用于计算机视觉,然后被引入到自然语言处理中,并在许多任务中实现了改进。DA方法的主要焦点之一是提高训练数据的多样性,从而帮助模型更好地推广到看不见的测试数据。在这项调查中,我们根据扩增数据的多样性将DA方法分为三类,包括转述、噪声和采样。我们的论文开始根据以上类别详细分析DA方法。此外,我们还介绍了它们在NLP任务中的应用以及面临的挑战。附录A提供了一些有用的资源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Data augmentation approaches in natural language processing: A survey

As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some useful resources are provided in Appendix A.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
45.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信