Lessons Learned: A Case Study in Creating a Data Pipeline using Twitter’s API

2020 Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2020-04-01 DOI:10.1109/SIEDS49339.2020.9106584

Jason Tiezzi, Rice Tyler, Suchetha Sharma

{"title":"Lessons Learned: A Case Study in Creating a Data Pipeline using Twitter’s API","authors":"Jason Tiezzi, Rice Tyler, Suchetha Sharma","doi":"10.1109/SIEDS49339.2020.9106584","DOIUrl":null,"url":null,"abstract":"With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.","PeriodicalId":331495,"journal":{"name":"2020 Systems and Information Engineering Design Symposium (SIEDS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS49339.2020.9106584","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.

查看原文本刊更多论文

经验教训:使用Twitter的API创建数据管道的案例研究

Twitter拥有超过3亿的用户，包括政治和娱乐领域的精英们频繁发布的帖子，已经成为挖掘和分析数据的丰富领域。尽管它在社会科学研究中占有突出地位，但对数据获取过程的关注相对较少。为此，我们的研究使用了一个案例研究来说明获取和存储tweet的过程。为了构建数据管道，我们首先申请并创建Twitter开发人员帐户，并使用Python中的Tweepy应用程序与Twitter的API进行交互。我们创建了一个程序，该程序使用生产者-消费者多线程模型从API请求tweet，然后清理数据并将其推送到具有四个表的MySQL数据库:一个用于tweet，一个用于用户信息，一个用于转发，一个用于特殊实体(例如，hashtag)。随着我们的管道运作，我们探讨了候选人性别如何影响2020年民主党总统初选中的推特话语。具体来说，我们使用无监督文本分析方法来检查词频、情绪和情感维度的差异。我们发现性别是围绕女性候选人的话语的中心，但对于男性候选人来说是次要的。在我们的数据集中，围绕女性候选人的话语也更加快乐和积极。最后，通过我们的案例研究，我们为希望获取和利用Twitter数据进行社会科学研究的未来研究人员提供了经验教训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量