AugGPT: Leveraging ChatGPT for Text Data Augmentation

IF 5.7 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2025-01-30 DOI:10.1109/TBDATA.2025.3536934

Haixing Dai;Zhengliang Liu;Wenxiong Liao;Xiaoke Huang;Yihan Cao;Zihao Wu;Lin Zhao;Shaochen Xu;Fang Zeng;Wei Liu;Ninghao Liu;Sheng Li;Dajiang Zhu;Hongmin Cai;Lichao Sun;Quanzheng Li;Dinggang Shen;Tianming Liu;Xiang Li

{"title":"AugGPT: Leveraging ChatGPT for Text Data Augmentation","authors":"Haixing Dai;Zhengliang Liu;Wenxiong Liao;Xiaoke Huang;Yihan Cao;Zihao Wu;Lin Zhao;Shaochen Xu;Fang Zeng;Wei Liu;Ninghao Liu;Sheng Li;Dajiang Zhu;Hongmin Cai;Lichao Sun;Quanzheng Li;Dinggang Shen;Tianming Liu;Xiang Li","doi":"10.1109/TBDATA.2025.3536934","DOIUrl":null,"url":null,"abstract":"Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning (FSL) scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely used strategy to mitigate such challenges is to perform data augmentation to better capture data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness), or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models (LLM), especially the development of ChatGPT, we propose a text data augmentation approach based on ChatGPT (named ”AugGPT”). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on multiple few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"907-918"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10858342/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning (FSL) scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely used strategy to mitigate such challenges is to perform data augmentation to better capture data invariance and increase the sample size. However, current text data augmentation methods either can’t ensure the correct labeling of the generated data (lacking faithfulness), or can’t ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models (LLM), especially the development of ChatGPT, we propose a text data augmentation approach based on ChatGPT (named ”AugGPT”). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on multiple few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

查看原文本刊更多论文

AugGPT：利用ChatGPT进行文本数据增强

在许多自然语言处理（NLP）任务中，文本数据增强是克服样本量有限挑战的有效策略。这一挑战在少次学习（FSL）场景中尤为突出，在这种场景中，目标领域中的数据通常要少得多，质量也较低。缓解此类挑战的一种自然且广泛使用的策略是执行数据增强，以更好地捕获数据不变性并增加样本量。然而，目前的文本数据增强方法要么不能保证生成数据的正确标注（缺乏信度），要么不能保证生成数据的足够多样性（缺乏紧凑性），要么两者兼而有之。受近年来大型语言模型（LLM）的成功，特别是ChatGPT的发展的启发，我们提出了一种基于ChatGPT的文本数据增强方法（命名为“AugGPT”）。AugGPT将训练样本中的每个句子重新表述为多个概念相似但语义不同的样本。增强后的样本可以用于下游模型训练。在多个小样本学习文本分类任务上的实验结果表明，本文提出的AugGPT方法在测试准确率和增强样本分布方面优于当前最先进的文本数据增强方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.