Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario

International Conference on Web and Social Media Pub Date : 2023-06-02 DOI:10.1609/icwsm.v17i1.22165

Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger

{"title":"Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario","authors":"Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger","doi":"10.1609/icwsm.v17i1.22165","DOIUrl":null,"url":null,"abstract":"Characterizing the demographics of social media users\nenables a diversity of applications, from better targeting of policy interventions to the derivation of representative population\nestimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users.\nSpecifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content.\nWe find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.","PeriodicalId":175641,"journal":{"name":"International Conference on Web and Social Media","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Web and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/icwsm.v17i1.22165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.

查看原文本刊更多论文

低资源情境下社交媒体用户的大规模人口统计推断

描述社交媒体用户的人口统计特征可用于多种应用，从更好地确定政策干预的目标到对社会现象的代表性人口估计的推导。然而，通过监督学习实现高性能可能具有挑战性，因为标记数据通常是稀缺的。另外，基于规则的匹配策略提供了有充分根据的信息，但只提供了对用户的部分覆盖。因此，不清楚哪些特性和模型最适合在保持高性能的同时最大限度地覆盖大量用户。在本文中，我们开发了一个成本效益的策略，大规模人口推断依靠最小的标签工作。我们将名称匹配策略与基于图形的方法结合起来，绘制了180万尼日利亚Twitter用户的人口统计图。具体来说，我们比较了纯基于图的传播模型，即标签传播(LP)和图卷积网络(GCN)， GCN是一种基于用户内容合并节点特征的图模型。我们发现，这两种模型在很大程度上都优于纯粹基于缺乏图形信息的用户内容的监督学习方法。值得注意的是，我们发现LP实现了与最先进的GCN相当的性能，同时以更低的计算成本提供了更高的可解释性。此外，添加特定于用户的特性(如用户tweet的文本表示和用户地理位置)并不能显著提高性能。利用我们的数据收集工作，我们描述了尼日利亚Twitter的人口组成，发现它是尼日利亚一般人口的一个高度不统一的样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Web and Social Media

自引率

0.00%

发文量