我们不知道我们不知道的:Twitter公共api的使用何时以及如何影响科学推断

Rebekah Tromble, A. Storz, D. Stockmann
{"title":"我们不知道我们不知道的:Twitter公共api的使用何时以及如何影响科学推断","authors":"Rebekah Tromble, A. Storz, D. Stockmann","doi":"10.2139/ssrn.3079927","DOIUrl":null,"url":null,"abstract":"Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.","PeriodicalId":169291,"journal":{"name":"PSN: Computational Models (Quantitative) (Topic)","volume":"461 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference\",\"authors\":\"Rebekah Tromble, A. Storz, D. Stockmann\",\"doi\":\"10.2139/ssrn.3079927\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.\",\"PeriodicalId\":169291,\"journal\":{\"name\":\"PSN: Computational Models (Quantitative) (Topic)\",\"volume\":\"461 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PSN: Computational Models (Quantitative) (Topic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3079927\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PSN: Computational Models (Quantitative) (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3079927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

摘要

尽管对Twitter的研究激增,但没有明确的数据收集标准。在使用关键字查询时,最常见的数据源(Search and Streaming api)很少返回完整的tweet,学者们不知道他们的数据是否构成代表性样本。本文试图对可能导致的潜在偏见提供迄今为止最全面的看法。我们使用来自Firehose(提供tweet的全部数量,但成本高昂)、Streaming和Search api的四个相同关键字查询的数据,使用Kendall 's-tau和logit回归分析来理解数据集中的差异,包括哪些用户和内容特征使tweet或多或少可能出现在采样结果中。我们发现,在我们研究的几乎所有数据集中,确实存在系统性差异,可能会使学者的研究结果产生偏差,我们建议在未来的Twitter研究中格外谨慎。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference
Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信