{"title":"论推特的Spritzer和Gardenhose样本流的内生","authors":"Dennis Kergl, R. Roedler, Sebastian Seeber","doi":"10.1109/ASONAM.2014.6921610","DOIUrl":null,"url":null,"abstract":"Many recent publications deal with trend analysis, event detection or opinion mining on social media data. Twitter, as the most important microblogging service, is often in the focus of these works, as it offers free access to big volumes of data. The free access, on that many publications rely, is composed of a random subset of the complete public status stream. Publications rely particularly on the uniform distribution of tweets in this sample stream, and therefore, till today, one has to trust in the statement of Twitter that the sample data is indeed uniformly distributed1. In our research on the technical properties of Twitter's streaming data, we found evidence for discovering the method used by Twitter to decide which tweets will show up in the random sample streams. A deeper insight into this process leads to the possible reasons of why Twitter chose the presented sampling method. For this purpose we provide an overview of how Twitter's unique tweet IDs are generated and explain the regularities of each part of a tweet ID. This results also in some information about the tweet ID generating infrastructure of Twitter and what kind of knowledge can possibly be derived from small features like the tweet ID.","PeriodicalId":143584,"journal":{"name":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":"{\"title\":\"On the endogenesis of Twitter's Spritzer and Gardenhose sample streams\",\"authors\":\"Dennis Kergl, R. Roedler, Sebastian Seeber\",\"doi\":\"10.1109/ASONAM.2014.6921610\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many recent publications deal with trend analysis, event detection or opinion mining on social media data. Twitter, as the most important microblogging service, is often in the focus of these works, as it offers free access to big volumes of data. The free access, on that many publications rely, is composed of a random subset of the complete public status stream. Publications rely particularly on the uniform distribution of tweets in this sample stream, and therefore, till today, one has to trust in the statement of Twitter that the sample data is indeed uniformly distributed1. In our research on the technical properties of Twitter's streaming data, we found evidence for discovering the method used by Twitter to decide which tweets will show up in the random sample streams. A deeper insight into this process leads to the possible reasons of why Twitter chose the presented sampling method. For this purpose we provide an overview of how Twitter's unique tweet IDs are generated and explain the regularities of each part of a tweet ID. This results also in some information about the tweet ID generating infrastructure of Twitter and what kind of knowledge can possibly be derived from small features like the tweet ID.\",\"PeriodicalId\":143584,\"journal\":{\"name\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASONAM.2014.6921610\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM.2014.6921610","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On the endogenesis of Twitter's Spritzer and Gardenhose sample streams
Many recent publications deal with trend analysis, event detection or opinion mining on social media data. Twitter, as the most important microblogging service, is often in the focus of these works, as it offers free access to big volumes of data. The free access, on that many publications rely, is composed of a random subset of the complete public status stream. Publications rely particularly on the uniform distribution of tweets in this sample stream, and therefore, till today, one has to trust in the statement of Twitter that the sample data is indeed uniformly distributed1. In our research on the technical properties of Twitter's streaming data, we found evidence for discovering the method used by Twitter to decide which tweets will show up in the random sample streams. A deeper insight into this process leads to the possible reasons of why Twitter chose the presented sampling method. For this purpose we provide an overview of how Twitter's unique tweet IDs are generated and explain the regularities of each part of a tweet ID. This results also in some information about the tweet ID generating infrastructure of Twitter and what kind of knowledge can possibly be derived from small features like the tweet ID.