基于BERT的文本增强方法在小型不平衡数据集上的性能评价

2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI) Pub Date : 2022-12-01 DOI:10.1109/CogMI56440.2022.00027

Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang

{"title":"基于BERT的文本增强方法在小型不平衡数据集上的性能评价","authors":"Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang","doi":"10.1109/CogMI56440.2022.00027","DOIUrl":null,"url":null,"abstract":"Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.","PeriodicalId":211430,"journal":{"name":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets\",\"authors\":\"Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang\",\"doi\":\"10.1109/CogMI56440.2022.00027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.\",\"PeriodicalId\":211430,\"journal\":{\"name\":\"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CogMI56440.2022.00027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogMI56440.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近，深度学习方法在理解和分析短信方面取得了巨大的成功。然而，在现实世界的应用中，由于数据收集和人工注释的高成本，标记的文本数据通常尺寸较小，并且在类中不平衡，从而限制了深度学习分类器的性能。因此，本研究探索了一个未被充分研究的领域——样本大小和不平衡比例如何影响深度学习模型和增强方法的性能——并提供了一个解决方案。具体来说，本研究考察了BERT、Word2Vec和WordNet增强方法在数据集规模为500、1000和2000、失衡比例为4:1和9:1的情况下的性能。实验结果表明，BERT增强提高了BERT检测少数类的性能，并且当数据规模较小(例如500个训练文档)且高度不平衡(例如9:1)时，改进效果最为显著(与基本模型相比F1提高15.6-40.4%，与过采样方法模型相比F1提高2.8%-10.4%)。当数据量增大或失衡比减小时，BERT增强所产生的改进变小或不显著。此外，与其他模型和方法相比，BERT增强和BERT微调实现了最佳性能，为小尺寸、高度不平衡的文本分类任务展示了一个有前途的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets

Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)

自引率

0.00%

发文量