Surveying public opinion using label prediction on social media data

Marija Stanojevic, Jumanah Alshehri, Z. Obradovic
{"title":"Surveying public opinion using label prediction on social media data","authors":"Marija Stanojevic, Jumanah Alshehri, Z. Obradovic","doi":"10.1145/3341161.3342861","DOIUrl":null,"url":null,"abstract":"In this study, a procedure is proposed for surveying public opinion from big social media domain-specific textual data to minimize the difficulties associated with modeling public behavior. Strategies for labeling posts relevant to a topic are discussed. A two-part framework is proposed in which semiautomatic labeling is applied to a small subset of posts, referred to as the “seed” in further text. This seed is used as bases for semi-supervised labeling of the rest of the data. The hypothesis is that the proposed method will achieve better labeling performance than existing classification models when applied to small amounts of labeled data. The seed is labeled using posts of users with a known and consistent view on the topic. A semi-supervised multi-class prediction model labels the remaining data iteratively. In each iteration, it adds context-label pairs to the training set if softmax-based label probabilities are above the threshold. The proposed method is characterized on four datasets by comparison to the three popular text modeling algorithms (n-grams + tfidf, fastText, VDCNN) for different sizes of labeled seeds (5,000 and 50,000 posts) and for several label-prediction significance thresholds. Our proposed semi-supervised method outperformed alternative algorithms by capturing additional contexts from the unlabeled data. The accuracy of the algorithm was increasing by (3-10%) when using a larger fraction of data as the seed. For the smaller seed, lower label probability threshold was clearly a better choice, while for larger seeds no predominant threshold was observed. The proposed framework, using fastText library for efficient text classification and representation learning, achieved the best results for a smaller seed, while VDCNN wrapped in the proposed framework achieved the best results for the bigger seed. The performance was negatively influenced by the number of classes. Finally, the model was applied to characterize a biased dataset of opinions related to gun control/rights advocacy. The proposed semi-automatic seed labeling is used to label 8,448 twitter posts of 171 advocates for guns control/rights. On this application, our approach performed better than existing models and it achieves 96.5% accuracy and 0.68 F1 score.","PeriodicalId":403360,"journal":{"name":"2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341161.3342861","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

In this study, a procedure is proposed for surveying public opinion from big social media domain-specific textual data to minimize the difficulties associated with modeling public behavior. Strategies for labeling posts relevant to a topic are discussed. A two-part framework is proposed in which semiautomatic labeling is applied to a small subset of posts, referred to as the “seed” in further text. This seed is used as bases for semi-supervised labeling of the rest of the data. The hypothesis is that the proposed method will achieve better labeling performance than existing classification models when applied to small amounts of labeled data. The seed is labeled using posts of users with a known and consistent view on the topic. A semi-supervised multi-class prediction model labels the remaining data iteratively. In each iteration, it adds context-label pairs to the training set if softmax-based label probabilities are above the threshold. The proposed method is characterized on four datasets by comparison to the three popular text modeling algorithms (n-grams + tfidf, fastText, VDCNN) for different sizes of labeled seeds (5,000 and 50,000 posts) and for several label-prediction significance thresholds. Our proposed semi-supervised method outperformed alternative algorithms by capturing additional contexts from the unlabeled data. The accuracy of the algorithm was increasing by (3-10%) when using a larger fraction of data as the seed. For the smaller seed, lower label probability threshold was clearly a better choice, while for larger seeds no predominant threshold was observed. The proposed framework, using fastText library for efficient text classification and representation learning, achieved the best results for a smaller seed, while VDCNN wrapped in the proposed framework achieved the best results for the bigger seed. The performance was negatively influenced by the number of classes. Finally, the model was applied to characterize a biased dataset of opinions related to gun control/rights advocacy. The proposed semi-automatic seed labeling is used to label 8,448 twitter posts of 171 advocates for guns control/rights. On this application, our approach performed better than existing models and it achieves 96.5% accuracy and 0.68 F1 score.
利用社交媒体数据的标签预测来调查民意
在本研究中,提出了一种从大型社交媒体领域特定文本数据中调查民意的程序,以最大限度地减少与公众行为建模相关的困难。讨论了贴标签与主题相关的帖子的策略。提出了一个由两部分组成的框架,其中半自动标记应用于一小部分帖子,在进一步的文本中称为“种子”。该种子用作对其余数据进行半监督标记的基础。假设当应用于少量标记数据时,所提出的方法将比现有的分类模型获得更好的标记性能。种子使用对该主题具有已知和一致观点的用户的帖子进行标记。半监督多类预测模型对剩余数据进行迭代标记。在每次迭代中,如果基于softmax的标签概率高于阈值,则向训练集中添加上下文标签对。通过对比三种流行的文本建模算法(n-grams + tfidf, fastText, VDCNN)在四个数据集上对不同大小的标记种子(5,000和50,000篇文章)和几个标签预测显著性阈值进行了表征。我们提出的半监督方法通过从未标记的数据中捕获额外的上下文而优于其他算法。当使用较大比例的数据作为种子时,算法的准确率提高了(3-10%)。对于较小的种子,较低的标签概率阈值显然是较好的选择,而对于较大的种子,没有观察到优势阈值。该框架使用fastText库进行高效的文本分类和表示学习,对于较小的种子取得了最好的结果,而封装在该框架中的VDCNN对于较大的种子取得了最好的结果。性能受到班级数量的负面影响。最后,该模型被用于描述与枪支管制/权利倡导相关的有偏见的意见数据集。拟议的半自动种子标签用于标记171名枪支管制/权利倡导者的8,448条推特帖子。在此应用中,我们的方法表现优于现有模型,准确率达到96.5%,F1得分为0.68。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信