低资源情境下社交媒体用户的大规模人口统计推断

Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger
{"title":"低资源情境下社交媒体用户的大规模人口统计推断","authors":"Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger","doi":"10.1609/icwsm.v17i1.22165","DOIUrl":null,"url":null,"abstract":"Characterizing the demographics of social media users\nenables a diversity of applications, from better targeting of policy interventions to the derivation of representative population\nestimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users.\nSpecifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content.\nWe find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.","PeriodicalId":175641,"journal":{"name":"International Conference on Web and Social Media","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario\",\"authors\":\"Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, I. Farouq, Victor Orozco-Olvera, S. Fraiberger\",\"doi\":\"10.1609/icwsm.v17i1.22165\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Characterizing the demographics of social media users\\nenables a diversity of applications, from better targeting of policy interventions to the derivation of representative population\\nestimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users.\\nSpecifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content.\\nWe find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.\",\"PeriodicalId\":175641,\"journal\":{\"name\":\"International Conference on Web and Social Media\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Web and Social Media\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1609/icwsm.v17i1.22165\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Web and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/icwsm.v17i1.22165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

描述社交媒体用户的人口统计特征可用于多种应用,从更好地确定政策干预的目标到对社会现象的代表性人口估计的推导。然而,通过监督学习实现高性能可能具有挑战性,因为标记数据通常是稀缺的。另外,基于规则的匹配策略提供了有充分根据的信息,但只提供了对用户的部分覆盖。因此,不清楚哪些特性和模型最适合在保持高性能的同时最大限度地覆盖大量用户。在本文中,我们开发了一个成本效益的策略,大规模人口推断依靠最小的标签工作。我们将名称匹配策略与基于图形的方法结合起来,绘制了180万尼日利亚Twitter用户的人口统计图。具体来说,我们比较了纯基于图的传播模型,即标签传播(LP)和图卷积网络(GCN), GCN是一种基于用户内容合并节点特征的图模型。我们发现,这两种模型在很大程度上都优于纯粹基于缺乏图形信息的用户内容的监督学习方法。值得注意的是,我们发现LP实现了与最先进的GCN相当的性能,同时以更低的计算成本提供了更高的可解释性。此外,添加特定于用户的特性(如用户tweet的文本表示和用户地理位置)并不能显著提高性能。利用我们的数据收集工作,我们描述了尼日利亚Twitter的人口组成,发现它是尼日利亚一般人口的一个高度不统一的样本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario
Characterizing the demographics of social media users enables a diversity of applications, from better targeting of policy interventions to the derivation of representative population estimates of social phenomena. Achieving high performance with supervised learning, however, can be challenging as labeled data is often scarce. Alternatively, rule-based matching strategies provide well-grounded information but only offer partial coverage over users. It is unclear, therefore, what features and models are best suited to maximize coverage over a large set of users while maintaining high performance. In this paper, we develop a cost-effective strategy for large-scale demographic inference by relying on minimal labeling efforts. We combine a name-matching strategy with graph-based methods to map the demographics of 1.8 million Nigerian Twitter users. Specifically, we compare a purely graph-based propagation model, namely Label Propagation (LP), with Graph Convolutional Networks (GCN), a graph model that also incorporates node features based on user content. We find that both models largely outperform supervised learning approaches based purely on user content that lack graph information. Notably, we find that LP achieves comparable performance to the state-of-the-art GCN while providing greater interpretability at a lower computing cost. Moreover, performance does not significantly improve with the addition of user-specific features, such as textual representations of user tweets and user geolocation. Leveraging our data collection effort, we describe the demographic composition of Nigerian Twitter finding that it is a highly non-uniform sample of the general Nigerian population.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信