使用 BERT 主题建模对电子烟相关推文进行分类

D. Murthy , S. Keshari , S. Arora , Q. Yang , A. Loukas , S.J. Schwartz , M.B. Harrell , E.T. Hébert , A.V. Wilkinson
{"title":"使用 BERT 主题建模对电子烟相关推文进行分类","authors":"D. Murthy ,&nbsp;S. Keshari ,&nbsp;S. Arora ,&nbsp;Q. Yang ,&nbsp;A. Loukas ,&nbsp;S.J. Schwartz ,&nbsp;M.B. Harrell ,&nbsp;E.T. Hébert ,&nbsp;A.V. Wilkinson","doi":"10.1016/j.etdah.2024.100160","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.</div></div><div><h3>Methods</h3><div>We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.</div></div><div><h3>Results</h3><div>We were able to successfully identify &gt;100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value &lt; 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.</div></div><div><h3>Discussion/conclusion</h3><div>Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).</div></div>","PeriodicalId":72899,"journal":{"name":"Emerging trends in drugs, addictions, and health","volume":"4 ","pages":"Article 100160"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Categorizing E-cigarette-related tweets using BERT topic modeling\",\"authors\":\"D. Murthy ,&nbsp;S. Keshari ,&nbsp;S. Arora ,&nbsp;Q. Yang ,&nbsp;A. Loukas ,&nbsp;S.J. Schwartz ,&nbsp;M.B. Harrell ,&nbsp;E.T. Hébert ,&nbsp;A.V. Wilkinson\",\"doi\":\"10.1016/j.etdah.2024.100160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.</div></div><div><h3>Methods</h3><div>We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.</div></div><div><h3>Results</h3><div>We were able to successfully identify &gt;100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value &lt; 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.</div></div><div><h3>Discussion/conclusion</h3><div>Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).</div></div>\",\"PeriodicalId\":72899,\"journal\":{\"name\":\"Emerging trends in drugs, addictions, and health\",\"volume\":\"4 \",\"pages\":\"Article 100160\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Emerging trends in drugs, addictions, and health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667118224000199\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emerging trends in drugs, addictions, and health","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667118224000199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景社交媒体平台是推广电子烟的重要渠道,尤其是在青少年中,因此对其丰富多样的内容进行分析对于公共卫生干预至关重要。电子烟使用率很高,有证据表明,社交媒体是通过直接和间接营销手段推广电子烟使用的热门论坛。社交媒体上与电子烟相关的信息量大且种类繁多,具有挑战性,可能会混淆公共卫生预防信息。传统的手工编码方法耗费大量人力,并且限制了可扩展性。相比之下,无监督机器学习方法(如主题建模)可对大型数据集进行高效分析,发现人工方法无法大规模实现的模式和趋势。本研究的重点是确定与电子烟相关的推文中的主题和话题在多大程度上可以通过机器学习成功地转化为有用的同质单元。更好地了解当前社交媒体上围绕电子烟产品和使用的描述和讨论,可以为公共卫生反信息传递和政策干预提供依据。方法我们使用主题建模(BERTopic)反复推导出与电子烟相关的推文群组,并计算特定词语对这些群组的重要性。我们对聚类推文进行了定性内容分析。我们还试图使用自动地理解析方法确定电子烟对话的地理位置,该方法将文本数据中的地名翻译成地理标识符,以尝试推断推文的位置。我们对机器得出的主题进行了相关性和主题间图谱分析,研究了主题之间的关系,结果表明,大多数主题都是独一无二的(相关性值为 0.5),而且相互之间没有重叠。我们确定了六个主题:口味和一次性吸管、大麻、吸塑店和可充装吸塑、吸塑文化、反吸塑和戒烟,以及西班牙语推文和吸塑尼古丁。使用定性方法对这些主题进行的进一步分析确定了每个主题中的主题。例如,类别 6(西班牙推文和吸食尼古丁)包括四个主题,分别侧重于吸食电子烟的健康风险、吸食电子烟的个人动机以及对吸食电子烟产品的监管。通过使用自动检测位置信息的地理解析技术,我们发现美国与吸烟相关的推文数量最多。讨论/结论结果强调了利用 BERTopic 建模减少大量数据的可能性,从而全面描述和分类社交媒体用户接触到的无数电子烟相关信息。这种数据缩减方法可应用于各种社交媒体平台,对电子烟帖子进行描述和分类,从而对研究结果进行三角测量和验证。对通过该技术确定的主题进行专题内容分析需要监督和人力投入。我们的方法让人们全面了解了不断演变的电子烟言论,为公共卫生反信息和政策干预提供了依据。此外,研究结果支持了监管的必要性,如减少吸引人的口味,并表明社交媒体可有效用于支持公共卫生信息(如戒烟信息)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Categorizing E-cigarette-related tweets using BERT topic modeling

Background

Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.

Methods

We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.

Results

We were able to successfully identify >100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value < 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.

Discussion/conclusion

Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Emerging trends in drugs, addictions, and health
Emerging trends in drugs, addictions, and health Pharmacology, Psychiatry and Mental Health, Forensic Medicine, Drug Discovery, Pharmacology, Toxicology and Pharmaceutics (General)
CiteScore
2.40
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信