A survey of data augmentation in named entity recognition

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-10 DOI:10.1016/j.neucom.2025.130856

Yi Huang , Yuhan Gao , Chengjuan Ren

{"title":"A survey of data augmentation in named entity recognition","authors":"Yi Huang , Yuhan Gao , Chengjuan Ren","doi":"10.1016/j.neucom.2025.130856","DOIUrl":null,"url":null,"abstract":"<div><div>Data augmentation (DA), initially prominent in Computer Vision (CV), has been successfully adapted to Natural Language Processing (NLP), proving effective in mitigating data scarcity problems in the context of few-shot settings or scenarios where deep learning techniques may underperform. Moreover, the primary goal of DA is to expand and diversify training datasets by different methods, enabling models to generate more diverse and high-quality sythetic data for training the NER models. This survey explored DA techniques in the context of Named Entity Recognition (NER), including linguistic features and four categories of data augmentation methods. Furthermore, we reviewed commonly used datasets in DA tasks, discussed some potential practical applications, and examined key challenges and future directions in DA for NER. These findings serve as a valuable reference for learners and offer insights for researchers. As an essential and cost-effective approach, DA alleviates data scarcity and overfitting in the NER models by facilitating the integration of diverse augmentation methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130856"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015280","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Data augmentation (DA), initially prominent in Computer Vision (CV), has been successfully adapted to Natural Language Processing (NLP), proving effective in mitigating data scarcity problems in the context of few-shot settings or scenarios where deep learning techniques may underperform. Moreover, the primary goal of DA is to expand and diversify training datasets by different methods, enabling models to generate more diverse and high-quality sythetic data for training the NER models. This survey explored DA techniques in the context of Named Entity Recognition (NER), including linguistic features and four categories of data augmentation methods. Furthermore, we reviewed commonly used datasets in DA tasks, discussed some potential practical applications, and examined key challenges and future directions in DA for NER. These findings serve as a valuable reference for learners and offer insights for researchers. As an essential and cost-effective approach, DA alleviates data scarcity and overfitting in the NER models by facilitating the integration of diverse augmentation methods.

查看原文本刊更多论文

命名实体识别中数据增强的研究

数据增强（DA），最初在计算机视觉（CV）中突出，已经成功地适应于自然语言处理（NLP），证明在少数镜头设置或深度学习技术可能表现不佳的场景中有效缓解数据稀缺问题。此外，数据挖掘的主要目标是通过不同的方法扩展和多样化训练数据集，使模型能够生成更多样化和高质量的综合数据，用于训练NER模型。本研究探讨了命名实体识别（NER）背景下的数据挖掘技术，包括语言特征和四类数据增强方法。此外，我们回顾了数据处理任务中常用的数据集，讨论了一些潜在的实际应用，并研究了面向NER的数据处理的主要挑战和未来方向。这些发现为学习者提供了有价值的参考，并为研究人员提供了见解。数据挖掘作为一种必要且经济有效的方法，通过促进多种增强方法的集成，缓解了NER模型中的数据稀缺性和过拟合问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.