{"title":"我们不知道我们不知道的:Twitter公共api的使用何时以及如何影响科学推断","authors":"Rebekah Tromble, A. Storz, D. Stockmann","doi":"10.2139/ssrn.3079927","DOIUrl":null,"url":null,"abstract":"Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.","PeriodicalId":169291,"journal":{"name":"PSN: Computational Models (Quantitative) (Topic)","volume":"461 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference\",\"authors\":\"Rebekah Tromble, A. Storz, D. Stockmann\",\"doi\":\"10.2139/ssrn.3079927\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.\",\"PeriodicalId\":169291,\"journal\":{\"name\":\"PSN: Computational Models (Quantitative) (Topic)\",\"volume\":\"461 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PSN: Computational Models (Quantitative) (Topic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3079927\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PSN: Computational Models (Quantitative) (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3079927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37
摘要
尽管对Twitter的研究激增,但没有明确的数据收集标准。在使用关键字查询时,最常见的数据源(Search and Streaming api)很少返回完整的tweet,学者们不知道他们的数据是否构成代表性样本。本文试图对可能导致的潜在偏见提供迄今为止最全面的看法。我们使用来自Firehose(提供tweet的全部数量,但成本高昂)、Streaming和Search api的四个相同关键字查询的数据,使用Kendall 's-tau和logit回归分析来理解数据集中的差异,包括哪些用户和内容特征使tweet或多或少可能出现在采样结果中。我们发现,在我们研究的几乎所有数据集中,确实存在系统性差异,可能会使学者的研究结果产生偏差,我们建议在未来的Twitter研究中格外谨慎。
We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference
Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.