Filling in the white space: Spatial interpolation with Gaussian processes and social media data

IF 2.2

Current research in ecological and social psychology Pub Date : 2023-01-01 DOI:10.1016/j.cresp.2023.100159

Salvatore Giorgi , Johannes C. Eichstaedt , Daniel Preoţiuc-Pietro , Jacob R. Gardner , H. Andrew Schwartz , Lyle H. Ungar

{"title":"Filling in the white space: Spatial interpolation with Gaussian processes and social media data","authors":"Salvatore Giorgi , Johannes C. Eichstaedt , Daniel Preoţiuc-Pietro , Jacob R. Gardner , H. Andrew Schwartz , Lyle H. Ungar","doi":"10.1016/j.cresp.2023.100159","DOIUrl":null,"url":null,"abstract":"<div><p>Full national coverage below the state level is difficult to attain through survey-based data collection. Even the largest survey-based data collections, such as the CDC's Behavioral Risk Factor Surveillance System or the Gallup-Healthways Well-being Index (both with more than 300,000 responses p.a.) only allow for the estimation of annual averages for about 260 out of roughly U.S. 3,000 counties when a threshold of 300 responses per county is used. Using a relatively high threshold of 300 responses gives substantially higher convergent validity–higher correlations with health variables–than lower thresholds but covers a reduced and biased sample of the population. We present principled methods to interpolate spatial estimates and show that including large-scale geotagged social media data can increase interpolation accuracy. In this work, we focus on Gallup-reported life satisfaction, a widely-used measure of subjective well-being. We use Gaussian Processes (GP), a formal Bayesian model, to interpolate life satisfaction, which we optimally combine with estimates from low-count data. We interpolate over several spaces (geographic and socioeconomic) and extend these evaluations to the space created by variables encoding language frequencies of approximately 6 million geotagged Twitter users. We find that Twitter language use can serve as a rough aggregate measure of socioeconomic and cultural similarity, and improves upon estimates derived from a wide variety of socioeconomic, demographic, and geographic similarity measures. We show that applying Gaussian Processes to the limited Gallup data allows us to generate estimates for a much larger number of counties while maintaining the same level of convergent validity with external criteria (i.e., N = 1,133 vs. 2,954 counties). This work suggests that spatial coverage of psychological variables can be reliably extended through Bayesian techniques while maintaining out-of-sample prediction accuracy and that Twitter language adds important information about cultural similarity over and above traditional socio-demographic and geographic similarity measures. Finally, to facilitate the adoption of these methods, we have also open-sourced an online tool that researchers can freely use to interpolate their data across geographies.</p></div>","PeriodicalId":72748,"journal":{"name":"Current research in ecological and social psychology","volume":"5 ","pages":"Article 100159"},"PeriodicalIF":2.2000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current research in ecological and social psychology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666622723000722","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Full national coverage below the state level is difficult to attain through survey-based data collection. Even the largest survey-based data collections, such as the CDC's Behavioral Risk Factor Surveillance System or the Gallup-Healthways Well-being Index (both with more than 300,000 responses p.a.) only allow for the estimation of annual averages for about 260 out of roughly U.S. 3,000 counties when a threshold of 300 responses per county is used. Using a relatively high threshold of 300 responses gives substantially higher convergent validity–higher correlations with health variables–than lower thresholds but covers a reduced and biased sample of the population. We present principled methods to interpolate spatial estimates and show that including large-scale geotagged social media data can increase interpolation accuracy. In this work, we focus on Gallup-reported life satisfaction, a widely-used measure of subjective well-being. We use Gaussian Processes (GP), a formal Bayesian model, to interpolate life satisfaction, which we optimally combine with estimates from low-count data. We interpolate over several spaces (geographic and socioeconomic) and extend these evaluations to the space created by variables encoding language frequencies of approximately 6 million geotagged Twitter users. We find that Twitter language use can serve as a rough aggregate measure of socioeconomic and cultural similarity, and improves upon estimates derived from a wide variety of socioeconomic, demographic, and geographic similarity measures. We show that applying Gaussian Processes to the limited Gallup data allows us to generate estimates for a much larger number of counties while maintaining the same level of convergent validity with external criteria (i.e., N = 1,133 vs. 2,954 counties). This work suggests that spatial coverage of psychological variables can be reliably extended through Bayesian techniques while maintaining out-of-sample prediction accuracy and that Twitter language adds important information about cultural similarity over and above traditional socio-demographic and geographic similarity measures. Finally, to facilitate the adoption of these methods, we have also open-sourced an online tool that researchers can freely use to interpolate their data across geographies.

查看原文本刊更多论文

填充空白:利用高斯过程和社交媒体数据进行空间插值

通过基于调查的数据收集，很难实现州一级以下的全国覆盖率。即使是最大的基于调查的数据收集，如美国疾病控制与预防中心的行为风险因素监测系统或盖洛普健康方式幸福指数（每年都有超过300000份回复），当使用每个县300份回复的阈值时，也只能估计美国约3000个县中约260个县的年平均值。与较低的阈值相比，使用300个回答的相对较高的阈值可以提供更高的收敛有效性——与健康变量的相关性更高——但涵盖了减少和有偏见的人群样本。我们提出了插值空间估计的原则性方法，并表明包括大规模地理标记的社交媒体数据可以提高插值精度。在这项工作中，我们重点关注盖洛普报告的生活满意度，这是一种广泛使用的主观幸福感衡量标准。我们使用高斯过程（GP），一个形式的贝叶斯模型，来插值生活满意度，我们将其与低计数数据的估计值最佳结合。我们在几个空间（地理和社会经济）上进行插值，并将这些评估扩展到由编码约600万地理标记推特用户的语言频率的变量创建的空间。我们发现，推特语言的使用可以作为社会经济和文化相似性的粗略综合衡量标准，并在各种社会经济、人口和地理相似性衡量标准的基础上有所改进。我们表明，将高斯过程应用于有限的盖洛普数据可以使我们生成更多县的估计值，同时与外部标准保持相同的收敛有效性水平（即，N=1133对2954县）。这项工作表明，可以通过贝叶斯技术可靠地扩展心理变量的空间覆盖范围，同时保持样本外预测的准确性，推特语言在传统的社会人口和地理相似性测量之外，还添加了关于文化相似性的重要信息。最后，为了促进这些方法的采用，我们还开源了一个在线工具，研究人员可以自由使用该工具在不同地区插入数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊