Lessons Learned: A Case Study in Creating a Data Pipeline using Twitter’s API

Jason Tiezzi, Rice Tyler, Suchetha Sharma
{"title":"Lessons Learned: A Case Study in Creating a Data Pipeline using Twitter’s API","authors":"Jason Tiezzi, Rice Tyler, Suchetha Sharma","doi":"10.1109/SIEDS49339.2020.9106584","DOIUrl":null,"url":null,"abstract":"With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.","PeriodicalId":331495,"journal":{"name":"2020 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS49339.2020.9106584","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.
经验教训:使用Twitter的API创建数据管道的案例研究
Twitter拥有超过3亿的用户,包括政治和娱乐领域的精英们频繁发布的帖子,已经成为挖掘和分析数据的丰富领域。尽管它在社会科学研究中占有突出地位,但对数据获取过程的关注相对较少。为此,我们的研究使用了一个案例研究来说明获取和存储tweet的过程。为了构建数据管道,我们首先申请并创建Twitter开发人员帐户,并使用Python中的Tweepy应用程序与Twitter的API进行交互。我们创建了一个程序,该程序使用生产者-消费者多线程模型从API请求tweet,然后清理数据并将其推送到具有四个表的MySQL数据库:一个用于tweet,一个用于用户信息,一个用于转发,一个用于特殊实体(例如,hashtag)。随着我们的管道运作,我们探讨了候选人性别如何影响2020年民主党总统初选中的推特话语。具体来说,我们使用无监督文本分析方法来检查词频、情绪和情感维度的差异。我们发现性别是围绕女性候选人的话语的中心,但对于男性候选人来说是次要的。在我们的数据集中,围绕女性候选人的话语也更加快乐和积极。最后,通过我们的案例研究,我们为希望获取和利用Twitter数据进行社会科学研究的未来研究人员提供了经验教训。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信