{"title":"A survey of data augmentation in named entity recognition","authors":"Yi Huang , Yuhan Gao , Chengjuan Ren","doi":"10.1016/j.neucom.2025.130856","DOIUrl":null,"url":null,"abstract":"<div><div>Data augmentation (DA), initially prominent in Computer Vision (CV), has been successfully adapted to Natural Language Processing (NLP), proving effective in mitigating data scarcity problems in the context of few-shot settings or scenarios where deep learning techniques may underperform. Moreover, the primary goal of DA is to expand and diversify training datasets by different methods, enabling models to generate more diverse and high-quality sythetic data for training the NER models. This survey explored DA techniques in the context of Named Entity Recognition (NER), including linguistic features and four categories of data augmentation methods. Furthermore, we reviewed commonly used datasets in DA tasks, discussed some potential practical applications, and examined key challenges and future directions in DA for NER. These findings serve as a valuable reference for learners and offer insights for researchers. As an essential and cost-effective approach, DA alleviates data scarcity and overfitting in the NER models by facilitating the integration of diverse augmentation methods.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130856"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015280","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Data augmentation (DA), initially prominent in Computer Vision (CV), has been successfully adapted to Natural Language Processing (NLP), proving effective in mitigating data scarcity problems in the context of few-shot settings or scenarios where deep learning techniques may underperform. Moreover, the primary goal of DA is to expand and diversify training datasets by different methods, enabling models to generate more diverse and high-quality sythetic data for training the NER models. This survey explored DA techniques in the context of Named Entity Recognition (NER), including linguistic features and four categories of data augmentation methods. Furthermore, we reviewed commonly used datasets in DA tasks, discussed some potential practical applications, and examined key challenges and future directions in DA for NER. These findings serve as a valuable reference for learners and offer insights for researchers. As an essential and cost-effective approach, DA alleviates data scarcity and overfitting in the NER models by facilitating the integration of diverse augmentation methods.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.