Generative artificial intelligence and machine learning methods to screen social media content.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-03-14 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2710

Kellen Sharp, Rachel R Ouellette, Rujula Singh Rajendra Singh, Elise E DeVito, Neil Kamdar, Amanda de la Noval, Dhiraj Murthy, Grace Kong

{"title":"Generative artificial intelligence and machine learning methods to screen social media content.","authors":"Kellen Sharp, Rachel R Ouellette, Rujula Singh Rajendra Singh, Elise E DeVito, Neil Kamdar, Amanda de la Noval, Dhiraj Murthy, Grace Kong","doi":"10.7717/peerj-cs.2710","DOIUrl":null,"url":null,"abstract":"Background: Social media research is confronted by the expansive and constantly evolving nature of social media data. Hashtags and keywords are frequently used to identify content related to a specific topic, but these search strategies often result in large numbers of irrelevant results. Therefore, methods are needed to quickly screen social media content based on a specific research question. The primary objective of this article is to present generative artificial intelligence (AI; e.g., ChatGPT) and machine learning methods to screen content from social media platforms. As a proof of concept, we apply these methods to identify TikTok content related to e-cigarette use during pregnancy.Methods: We searched TikTok for pregnancy and vaping content using 70 hashtag pairs related to \"pregnancy\" and \"vaping\" (e.g., #pregnancytok and #ecigarette) to obtain 11,673 distinct posts. We extracted post videos, descriptions, and metadata using Zeeschuimer and PykTok library. To enhance textual analysis, we employed automatic speech recognition via the Whisper system to transcribe verbal content from each video. Next, we used the OpenCV library to extract frames from the videos, followed by object and text detection analysis using Oracle Cloud Vision. Finally, we merged all text data to create a consolidated dataset and entered this dataset into ChatGPT-4 to determine which posts are related to vaping and pregnancy. To refine the ChatGPT prompt used to screen for content, a human coder cross-checked ChatGPT-4's outputs for 10 out of every 100 metadata entries, with errors used to inform the final prompt. The final prompt was evaluated through human review, confirming for posts that contain \"pregnancy\" and \"vape\" content, comparing determinations to those made by ChatGPT.Results: Our results indicated ChatGPT-4 classified 44.86% of the videos as exclusively related to pregnancy, 36.91% to vaping, and 8.91% as containing both topics. A human reviewer confirmed for vaping and pregnancy content in 45.38% of the TikTok posts identified by ChatGPT as containing relevant content. Human review of 10% of the posts screened out by ChatGPT identified a 99.06% agreement rate for excluded posts.Conclusions: ChatGPT has mixed capacity to screen social media content that has been converted into text data using machine learning techniques such as object detection. ChatGPT's sensitivity was found to be lower than a human coder in the current case example but has demonstrated power for screening out irrelevant content and can be used as an initial pass at screening content. Future studies should explore ways to enhance ChatGPT's sensitivity.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2710"},"PeriodicalIF":3.5000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11935761/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2710","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Social media research is confronted by the expansive and constantly evolving nature of social media data. Hashtags and keywords are frequently used to identify content related to a specific topic, but these search strategies often result in large numbers of irrelevant results. Therefore, methods are needed to quickly screen social media content based on a specific research question. The primary objective of this article is to present generative artificial intelligence (AI; e.g., ChatGPT) and machine learning methods to screen content from social media platforms. As a proof of concept, we apply these methods to identify TikTok content related to e-cigarette use during pregnancy.

Methods: We searched TikTok for pregnancy and vaping content using 70 hashtag pairs related to "pregnancy" and "vaping" (e.g., #pregnancytok and #ecigarette) to obtain 11,673 distinct posts. We extracted post videos, descriptions, and metadata using Zeeschuimer and PykTok library. To enhance textual analysis, we employed automatic speech recognition via the Whisper system to transcribe verbal content from each video. Next, we used the OpenCV library to extract frames from the videos, followed by object and text detection analysis using Oracle Cloud Vision. Finally, we merged all text data to create a consolidated dataset and entered this dataset into ChatGPT-4 to determine which posts are related to vaping and pregnancy. To refine the ChatGPT prompt used to screen for content, a human coder cross-checked ChatGPT-4's outputs for 10 out of every 100 metadata entries, with errors used to inform the final prompt. The final prompt was evaluated through human review, confirming for posts that contain "pregnancy" and "vape" content, comparing determinations to those made by ChatGPT.

Results: Our results indicated ChatGPT-4 classified 44.86% of the videos as exclusively related to pregnancy, 36.91% to vaping, and 8.91% as containing both topics. A human reviewer confirmed for vaping and pregnancy content in 45.38% of the TikTok posts identified by ChatGPT as containing relevant content. Human review of 10% of the posts screened out by ChatGPT identified a 99.06% agreement rate for excluded posts.

Conclusions: ChatGPT has mixed capacity to screen social media content that has been converted into text data using machine learning techniques such as object detection. ChatGPT's sensitivity was found to be lower than a human coder in the current case example but has demonstrated power for screening out irrelevant content and can be used as an initial pass at screening content. Future studies should explore ways to enhance ChatGPT's sensitivity.

查看原文本刊更多论文

生成式人工智能和机器学习方法来筛选社交媒体内容。

背景：社交媒体研究面临着社交媒体数据的膨胀和不断发展的性质。标签和关键字经常用于识别与特定主题相关的内容，但这些搜索策略通常会导致大量不相关的结果。因此，需要基于特定研究问题快速筛选社交媒体内容的方法。本文的主要目标是介绍生成式人工智能(AI；例如，ChatGPT)和机器学习方法来筛选社交媒体平台的内容。作为概念验证，我们应用这些方法来识别与怀孕期间使用电子烟相关的TikTok内容。方法：我们使用70对与“怀孕”和“吸电子烟”相关的标签（例如，#pregnancytok和# ecigtte）在TikTok上搜索怀孕和吸电子烟的内容，获得11673个不同的帖子。我们使用Zeeschuimer和PykTok库提取帖子视频、描述和元数据。为了加强文本分析，我们通过Whisper系统采用自动语音识别来转录每个视频中的口头内容。接下来，我们使用OpenCV库从视频中提取帧，然后使用Oracle Cloud Vision进行对象和文本检测分析。最后，我们合并所有的文本数据，创建一个统一的数据集，并将该数据集输入ChatGPT-4，以确定哪些帖子与电子烟和怀孕有关。为了改进用于筛选内容的ChatGPT提示符，人工编码人员对ChatGPT-4的输出进行了交叉检查，每100个元数据条目中有10个，错误用于通知最终提示符。最后的提示是通过人工审核来评估的，确认包含“怀孕”和“电子烟”内容的帖子，并将其与ChatGPT的决定进行比较。结果：我们的结果表明，ChatGPT-4将44.86%的视频分类为完全与怀孕有关，36.91%与电子烟有关，8.91%包含两个主题。在ChatGPT识别的抖音帖子中，有45.38%的帖子含有吸烟和怀孕的内容。在ChatGPT筛选出来的10%的帖子中，人工审核确定了被排除的帖子的99.06%的同意率。结论：ChatGPT在使用机器学习技术（如对象检测）筛选已转换为文本数据的社交媒体内容方面具有混合能力。在当前的案例中，ChatGPT的灵敏度被发现低于人类编码员，但它已经证明了筛选不相关内容的能力，并且可以用作筛选内容的初始通道。未来的研究应该探索如何提高ChatGPT的敏感性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.