On the Impact of Dataset Size:A Twitter Classification Case Study

Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Pub Date : 2021-12-14 DOI:10.1145/3486622.3493960

Thi-Huyen Nguyen, Hoang H. Nguyen, Zahra Ahmadi, Tuan-Anh Hoang, Thanh-Nam Doan

{"title":"On the Impact of Dataset Size:A Twitter Classification Case Study","authors":"Thi-Huyen Nguyen, Hoang H. Nguyen, Zahra Ahmadi, Tuan-Anh Hoang, Thanh-Nam Doan","doi":"10.1145/3486622.3493960","DOIUrl":null,"url":null,"abstract":"The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.","PeriodicalId":89230,"journal":{"name":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3486622.3493960","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

查看原文本刊更多论文

关于数据集大小的影响:一个Twitter分类案例研究

最近深度学习模型和预训练嵌入技术的出现和发展为监督学习创造了突破。通常，我们期望添加更多的标记数据可以提高监督模型的预测性能。另一方面，由于人工成本、数据隐私和计算约束等方面的困难，收集更多标记数据并不是一件容易的事情。因此，全面研究训练集大小与不同方法分类性能之间的关系，对于选择特定任务的学习模型具有重要意义。然而，文献中缺乏这样深入而系统的研究。在本文中，我们将重点放在Twitter上短而嘈杂的文本背景下的这种关系上。我们设计了一个系统的机制来全面观察监督学习模型在三个著名的Twitter任务:情感分析、信息性检测和信息相关性上随着数据量的增加而提高的性能。此外，我们还研究了在不同数据规模的情况下，与传统机器学习方法相比，最近的深度学习模型有多好。我们的大量实验表明(a)最近的预训练模型已经克服了大数据的要求，(b)一个好的文本表示选择比添加更多的数据更有影响，(c)添加更多的数据并不总是有利于监督学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

自引率

0.00%

发文量