On the Impact of Dataset Size:A Twitter Classification Case Study

Thi-Huyen Nguyen, Hoang H. Nguyen, Zahra Ahmadi, Tuan-Anh Hoang, Thanh-Nam Doan
{"title":"On the Impact of Dataset Size:A Twitter Classification Case Study","authors":"Thi-Huyen Nguyen, Hoang H. Nguyen, Zahra Ahmadi, Tuan-Anh Hoang, Thanh-Nam Doan","doi":"10.1145/3486622.3493960","DOIUrl":null,"url":null,"abstract":"The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.","PeriodicalId":89230,"journal":{"name":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3486622.3493960","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.
关于数据集大小的影响:一个Twitter分类案例研究
最近深度学习模型和预训练嵌入技术的出现和发展为监督学习创造了突破。通常,我们期望添加更多的标记数据可以提高监督模型的预测性能。另一方面,由于人工成本、数据隐私和计算约束等方面的困难,收集更多标记数据并不是一件容易的事情。因此,全面研究训练集大小与不同方法分类性能之间的关系,对于选择特定任务的学习模型具有重要意义。然而,文献中缺乏这样深入而系统的研究。在本文中,我们将重点放在Twitter上短而嘈杂的文本背景下的这种关系上。我们设计了一个系统的机制来全面观察监督学习模型在三个著名的Twitter任务:情感分析、信息性检测和信息相关性上随着数据量的增加而提高的性能。此外,我们还研究了在不同数据规模的情况下,与传统机器学习方法相比,最近的深度学习模型有多好。我们的大量实验表明(a)最近的预训练模型已经克服了大数据的要求,(b)一个好的文本表示选择比添加更多的数据更有影响,(c)添加更多的数据并不总是有利于监督学习。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信