用通用社交媒体小组解决事件研究中的选择偏差

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-05-29 DOI:10.1145/3185048

Han Zhang, Shawndra Hill, David M. Rothschild

{"title":"用通用社交媒体小组解决事件研究中的选择偏差","authors":"Han Zhang, Shawndra Hill, David M. Rothschild","doi":"10.1145/3185048","DOIUrl":null,"url":null,"abstract":"Data from Twitter have been employed in prior research to study the impacts of events. Conventionally, researchers use keyword-based samples of tweets to create a panel of Twitter users who mention event-related keywords during and after an event. However, the keyword-based sampling is limited in its objectivity dimension of data and information quality. First, the technique suffers from selection bias since users who discuss an event are already more likely to discuss event-related topics beforehand. Second, there are no viable control groups for comparison to a keyword-based sample of Twitter users. We propose an alternative sampling approach to construct panels of users defined by their geolocation. Geolocated panels are exogenous to the keywords in users’ tweets, resulting in less selection bias than the keyword panel method. Geolocated panels allow us to follow within-person changes over time and enable the creation of comparison groups. We compare different panels in two real-world settings: response to mass shootings and TV advertising. We first show the strength of the selection biases of keyword panels. Then, we empirically illustrate how geolocated panels reduce selection biases and allow meaningful comparison groups regarding the impact of the studied events. We are the first to provide a clear, empirical example of how a better panel selection design, based on an exogenous variable such as geography, both reduces selection bias compared to the current state of the art and increases the value of Twitter research for studying events. While we advocate for the use of a geolocated panel, we also discuss its weaknesses and application scenario seriously. This article also calls attention to the importance of selection bias in impacting the objectivity of social media data.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"138 1","pages":"1 - 24"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Addressing Selection Bias in Event Studies with General-Purpose Social Media Panels\",\"authors\":\"Han Zhang, Shawndra Hill, David M. Rothschild\",\"doi\":\"10.1145/3185048\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data from Twitter have been employed in prior research to study the impacts of events. Conventionally, researchers use keyword-based samples of tweets to create a panel of Twitter users who mention event-related keywords during and after an event. However, the keyword-based sampling is limited in its objectivity dimension of data and information quality. First, the technique suffers from selection bias since users who discuss an event are already more likely to discuss event-related topics beforehand. Second, there are no viable control groups for comparison to a keyword-based sample of Twitter users. We propose an alternative sampling approach to construct panels of users defined by their geolocation. Geolocated panels are exogenous to the keywords in users’ tweets, resulting in less selection bias than the keyword panel method. Geolocated panels allow us to follow within-person changes over time and enable the creation of comparison groups. We compare different panels in two real-world settings: response to mass shootings and TV advertising. We first show the strength of the selection biases of keyword panels. Then, we empirically illustrate how geolocated panels reduce selection biases and allow meaningful comparison groups regarding the impact of the studied events. We are the first to provide a clear, empirical example of how a better panel selection design, based on an exogenous variable such as geography, both reduces selection bias compared to the current state of the art and increases the value of Twitter research for studying events. While we advocate for the use of a geolocated panel, we also discuss its weaknesses and application scenario seriously. This article also calls attention to the importance of selection bias in impacting the objectivity of social media data.\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"138 1\",\"pages\":\"1 - 24\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3185048\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3185048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在之前的研究中，已经使用了Twitter的数据来研究事件的影响。按照惯例，研究人员使用基于关键字的推文样本来创建一个推特用户小组，这些用户在事件发生期间和之后都会提到与事件相关的关键字。然而，基于关键词的采样在数据的客观性维度和信息质量方面存在一定的局限性。首先，该技术存在选择偏差，因为讨论事件的用户更有可能事先讨论与事件相关的主题。其次，没有可行的控制组来与基于关键字的Twitter用户样本进行比较。我们提出了另一种抽样方法来构建由其地理位置定义的用户面板。地理定位面板对用户推文中的关键词是外生的，与关键词面板方法相比，选择偏差较小。地理位置的面板使我们能够随着时间的推移跟踪个人的变化，并允许创建比较组。我们比较了两种现实环境下的不同面板:对大规模枪击事件的反应和电视广告。我们首先展示了关键字面板的选择偏差的强度。然后，我们实证地说明了地理位置的面板如何减少选择偏差，并允许有意义的比较组关于研究事件的影响。我们首先提供了一个清晰的、实证的例子，说明基于地理等外生变量的更好的小组选择设计，与目前的技术水平相比，如何减少选择偏差，并增加Twitter研究对研究事件的价值。虽然我们提倡使用地理定位面板，但我们也认真讨论了它的缺点和应用场景。本文还呼吁注意选择偏见在影响社交媒体数据客观性方面的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Addressing Selection Bias in Event Studies with General-Purpose Social Media Panels

Data from Twitter have been employed in prior research to study the impacts of events. Conventionally, researchers use keyword-based samples of tweets to create a panel of Twitter users who mention event-related keywords during and after an event. However, the keyword-based sampling is limited in its objectivity dimension of data and information quality. First, the technique suffers from selection bias since users who discuss an event are already more likely to discuss event-related topics beforehand. Second, there are no viable control groups for comparison to a keyword-based sample of Twitter users. We propose an alternative sampling approach to construct panels of users defined by their geolocation. Geolocated panels are exogenous to the keywords in users’ tweets, resulting in less selection bias than the keyword panel method. Geolocated panels allow us to follow within-person changes over time and enable the creation of comparison groups. We compare different panels in two real-world settings: response to mass shootings and TV advertising. We first show the strength of the selection biases of keyword panels. Then, we empirically illustrate how geolocated panels reduce selection biases and allow meaningful comparison groups regarding the impact of the studied events. We are the first to provide a clear, empirical example of how a better panel selection design, based on an exogenous variable such as geography, both reduces selection bias compared to the current state of the art and increases the value of Twitter research for studying events. While we advocate for the use of a geolocated panel, we also discuss its weaknesses and application scenario seriously. This article also calls attention to the importance of selection bias in impacting the objectivity of social media data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量