WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media

IF 6.3 1区 文学 Q1 COMMUNICATION
Yini Zhang, Fan Chen, Jiyoun Suk, Zhiying Yue
{"title":"WordPPR: A Researcher-Driven Computational Keyword Selection Method for Text Data Retrieval from Digital Media","authors":"Yini Zhang, Fan Chen, Jiyoun Suk, Zhiying Yue","doi":"10.1080/19312458.2023.2278177","DOIUrl":null,"url":null,"abstract":"ABSTRACTDespite the increasing use of digital media data in communication research, a central challenge persists – retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond. AcknowledgementWe thank our reviewers, the editors, Dr. Karl Rohe, Dr. Nojin Kwak, and Dr. Dhavan Shah for their helpful feedback. We also thank Rui Wang, Dongdong Yang, and Xinxia Dong for assistance with the journal article coding.Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementThe method and application code files as well as the supplementary materials are available at https://osf.io/pcybz/.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2278177.Additional informationNotes on contributorsYini ZhangYini Zhang (Ph.D., University of Wisconsin–Madison) is an assistant professor in the Department of Communication at the University at Buffalo, State University of New York. She studies social media, media ecosystem, and political communication, using computational methods.Fan ChenFan Chen (Ph.D., University of Wisconsin–Madison) is a Data Scientist at Google. He studies and develops statistical methods for social media, genomics, and advertisement data. The bulk of this work was completed while he was a Ph.D. student at the University of Wisconsin–Madison.Jiyoun SukJiyoun Suk (Ph.D., University of Wisconsin-Madison) is an assistant professor in the Department of Communication at the University of Connecticut. She studies the role of networked communication in shaping social trust, activism, and polarization, using computational methods.Zhiying YueZhiying Yue (Ph.D., University at Buffalo) is a postdoctoral researcher at the Digital Wellness Lab, Boston Children’s Hospital, and Harvard Medical School. Her research interests generally focus on individuals’ social media use and psychological well-being.","PeriodicalId":47552,"journal":{"name":"Communication Methods and Measures","volume":null,"pages":null},"PeriodicalIF":6.3000,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communication Methods and Measures","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/19312458.2023.2278177","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}
引用次数: 0

Abstract

ABSTRACTDespite the increasing use of digital media data in communication research, a central challenge persists – retrieving data with maximal accuracy and coverage. Our investigation of keyword-based data collection practices in extant communication research reveals a one-step process, whereas our cross-disciplinary literature review suggests an iterative query expansion process guided by human knowledge and computer intelligence. Hence, we introduce the WordPPR method for keyword selection and text data retrieval, which entails four steps: 1) collecting an initial dataset using core/seed keyword(s); 2) constructing a word graph based on the dataset; 3) applying the Personalized PageRank (PPR) algorithm to rank words in proximity to the seed keyword(s) and selecting new keywords that optimize retrieval precision and recall; 4) repeating steps 1–3 to determine if additional data collection is needed. Without requiring corpus-wide sampling/analysis or extensive manual annotation, this method is well suited for data collection from large-scale digital media corpora. Our simulation studies demonstrate its robustness against parameter choice and its improvement upon other methods in suggesting additional keywords. Its application in Twitter data retrieval is also provided. By advancing a more systematic approach to text data retrieval, this study contributes to improving digital media data retrieval practices in communication research and beyond. AcknowledgementWe thank our reviewers, the editors, Dr. Karl Rohe, Dr. Nojin Kwak, and Dr. Dhavan Shah for their helpful feedback. We also thank Rui Wang, Dongdong Yang, and Xinxia Dong for assistance with the journal article coding.Disclosure statementNo potential conflict of interest was reported by the author(s).Data availability statementThe method and application code files as well as the supplementary materials are available at https://osf.io/pcybz/.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/19312458.2023.2278177.Additional informationNotes on contributorsYini ZhangYini Zhang (Ph.D., University of Wisconsin–Madison) is an assistant professor in the Department of Communication at the University at Buffalo, State University of New York. She studies social media, media ecosystem, and political communication, using computational methods.Fan ChenFan Chen (Ph.D., University of Wisconsin–Madison) is a Data Scientist at Google. He studies and develops statistical methods for social media, genomics, and advertisement data. The bulk of this work was completed while he was a Ph.D. student at the University of Wisconsin–Madison.Jiyoun SukJiyoun Suk (Ph.D., University of Wisconsin-Madison) is an assistant professor in the Department of Communication at the University of Connecticut. She studies the role of networked communication in shaping social trust, activism, and polarization, using computational methods.Zhiying YueZhiying Yue (Ph.D., University at Buffalo) is a postdoctoral researcher at the Digital Wellness Lab, Boston Children’s Hospital, and Harvard Medical School. Her research interests generally focus on individuals’ social media use and psychological well-being.
WordPPR:一种研究人员驱动的数字媒体文本数据检索计算关键词选择方法
摘要尽管在传播研究中越来越多地使用数字媒体数据,但一个核心挑战仍然存在-以最大的准确性和覆盖范围检索数据。我们对现有传播研究中基于关键字的数据收集实践的调查表明,这是一个一步的过程,而我们的跨学科文献综述表明,这是一个由人类知识和计算机智能指导的迭代查询扩展过程。因此,我们引入WordPPR方法进行关键词选择和文本数据检索,该方法包括四个步骤:1)使用核心/种子关键字收集初始数据集;2)基于数据集构建词图;3)应用个性化PageRank (PPR)算法对种子关键词附近的词进行排序,选择优化检索精度和召回率的新关键词;4)重复步骤1-3,以确定是否需要额外的数据收集。该方法不需要语料库范围内的采样/分析或大量的人工注释,非常适合大规模数字媒体语料库的数据收集。我们的仿真研究证明了该方法对参数选择的鲁棒性以及在建议附加关键词方面比其他方法的改进。并给出了其在Twitter数据检索中的应用。通过提出一种更系统的文本数据检索方法,本研究有助于改进传播研究及其他领域的数字媒体数据检索实践。感谢我们的审稿人、编辑Karl Rohe博士、Nojin Kwak博士和Dhavan Shah博士提供的有益反馈。同时感谢王锐、杨东东、董鑫霞对期刊文章编码的协助。披露声明作者未报告潜在的利益冲突。数据可用性声明方法和应用程序代码文件以及补充材料可在https://osf.io/pcybz/.Supplementary上获得材料本文的补充数据可在https://doi.org/10.1080/19312458.2023.2278177.Additional上获取信息贡献者张怡妮张怡妮(威斯康星大学麦迪逊分校博士)是纽约州立大学布法罗分校传播系的助理教授。她使用计算方法研究社交媒体、媒体生态系统和政治传播。陈凡(博士,威斯康星大学麦迪逊分校),谷歌数据科学家。他研究和开发社交媒体、基因组学和广告数据的统计方法。大部分工作是在他还是威斯康星大学麦迪逊分校的博士生时完成的。Jiyoun Suk(威斯康星大学麦迪逊分校博士),康涅狄格大学传播系助理教授。她使用计算方法研究网络传播在塑造社会信任、行动主义和两极分化方面的作用。岳志英(博士,美国布法罗大学),数字健康实验室博士后研究员,波士顿儿童医院和哈佛医学院。她的研究兴趣主要集中在个人的社交媒体使用和心理健康。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
21.10
自引率
1.80%
发文量
9
期刊介绍: Communication Methods and Measures aims to achieve several goals in the field of communication research. Firstly, it aims to bring attention to and showcase developments in both qualitative and quantitative research methodologies to communication scholars. This journal serves as a platform for researchers across the field to discuss and disseminate methodological tools and approaches. Additionally, Communication Methods and Measures seeks to improve research design and analysis practices by offering suggestions for improvement. It aims to introduce new methods of measurement that are valuable to communication scientists or enhance existing methods. The journal encourages submissions that focus on methods for enhancing research design and theory testing, employing both quantitative and qualitative approaches. Furthermore, the journal is open to articles devoted to exploring the epistemological aspects relevant to communication research methodologies. It welcomes well-written manuscripts that demonstrate the use of methods and articles that highlight the advantages of lesser-known or newer methods over those traditionally used in communication. In summary, Communication Methods and Measures strives to advance the field of communication research by showcasing and discussing innovative methodologies, improving research practices, and introducing new measurement methods.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信