基于BERT的文本增强方法在小型不平衡数据集上的性能评价

Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang
{"title":"基于BERT的文本增强方法在小型不平衡数据集上的性能评价","authors":"Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang","doi":"10.1109/CogMI56440.2022.00027","DOIUrl":null,"url":null,"abstract":"Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.","PeriodicalId":211430,"journal":{"name":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets\",\"authors\":\"Lingshu Hu, Can Li, Wenbo Wang, Bin Pang, Yi Shang\",\"doi\":\"10.1109/CogMI56440.2022.00027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.\",\"PeriodicalId\":211430,\"journal\":{\"name\":\"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CogMI56440.2022.00027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogMI56440.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近,深度学习方法在理解和分析短信方面取得了巨大的成功。然而,在现实世界的应用中,由于数据收集和人工注释的高成本,标记的文本数据通常尺寸较小,并且在类中不平衡,从而限制了深度学习分类器的性能。因此,本研究探索了一个未被充分研究的领域——样本大小和不平衡比例如何影响深度学习模型和增强方法的性能——并提供了一个解决方案。具体来说,本研究考察了BERT、Word2Vec和WordNet增强方法在数据集规模为500、1000和2000、失衡比例为4:1和9:1的情况下的性能。实验结果表明,BERT增强提高了BERT检测少数类的性能,并且当数据规模较小(例如500个训练文档)且高度不平衡(例如9:1)时,改进效果最为显著(与基本模型相比F1提高15.6-40.4%,与过采样方法模型相比F1提高2.8%-10.4%)。当数据量增大或失衡比减小时,BERT增强所产生的改进变小或不显著。此外,与其他模型和方法相比,BERT增强和BERT微调实现了最佳性能,为小尺寸、高度不平衡的文本分类任务展示了一个有前途的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance Evaluation of Text Augmentation Methods with BERT on Small-sized, Imbalanced Datasets
Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of data collection and human annotation, limiting the performance of deep learning classifiers. Therefore, this study explores an understudied area—how sample sizes and imbalance ratios influence the performance of deep learning models and augmentation methods—and provides a solution to this problem. Specifically, this study examines the performance of BERT, Word2Vec, and WordNet augmentation methods with BERT fine-tuning on datasets of sizes 500, 1,000, and 2,000 and imbalance ratios of 4:1 and 9:1. Experimental results show that BERT augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (15.6–40.4% F1 increase compared to the base model and 2.8%–10.4% F1 increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the BERT augmentation becomes smaller or insignificant. Moreover, BERT augmentation plus BERT fine-tuning achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信