AI-powered topic modeling: comparing LDA and BERTopic in analyzing opioid-related cardiovascular risks in women.

IF 2.8 4区 医学 Q2 MEDICINE, RESEARCH & EXPERIMENTAL
Experimental Biology and Medicine Pub Date : 2025-02-28 eCollection Date: 2025-01-01 DOI:10.3389/ebm.2025.10389
Li Ma, Ru Chen, Weigong Ge, Paul Rogers, Beverly Lyn-Cook, Huixiao Hong, Weida Tong, Ningning Wu, Wen Zou
{"title":"AI-powered topic modeling: comparing LDA and BERTopic in analyzing opioid-related cardiovascular risks in women.","authors":"Li Ma, Ru Chen, Weigong Ge, Paul Rogers, Beverly Lyn-Cook, Huixiao Hong, Weida Tong, Ningning Wu, Wen Zou","doi":"10.3389/ebm.2025.10389","DOIUrl":null,"url":null,"abstract":"<p><p>Topic modeling is a crucial technique in natural language processing (NLP), enabling the extraction of latent themes from large text corpora. Traditional topic modeling, such as Latent Dirichlet Allocation (LDA), faces limitations in capturing the semantic relationships in the text document although it has been widely applied in text mining. BERTopic, created in 2022, leveraged advances in deep learning and can capture the contextual relationships between words. In this work, we integrated Artificial Intelligence (AI) modules to LDA and BERTopic and provided a comprehensive comparison on the analysis of prescription opioid-related cardiovascular risks in women. Opioid use can increase the risk of cardiovascular problems in women such as arrhythmia, hypotension etc. 1,837 abstracts were retrieved and downloaded from PubMed as of April 2024 using three Medical Subject Headings (MeSH) words: \"opioid,\" \"cardiovascular,\" and \"women.\" Machine Learning of Language Toolkit (MALLET) was employed for the implementation of LDA. BioBERT was used for document embedding in BERTopic. Eighteen was selected as the optimal topic number for MALLET and 23 for BERTopic. ChatGPT-4-Turbo was integrated to interpret and compare the results. The short descriptions created by ChatGPT for each topic from LDA and BERTopic were highly correlated, and the performance accuracies of LDA and BERTopic were similar as determined by expert manual reviews of the abstracts grouped by their predominant topics. The results of the t-SNE (t-distributed Stochastic Neighbor Embedding) plots showed that the clusters created from BERTopic were more compact and well-separated, representing improved coherence and distinctiveness between the topics. Our findings indicated that AI algorithms could augment both traditional and contemporary topic modeling techniques. In addition, BERTopic has the connection port for ChatGPT-4-Turbo or other large language models in its algorithm for automatic interpretation, while with LDA interpretation must be manually, and needs special procedures for data pre-processing and stop words exclusion. Therefore, while LDA remains valuable for large-scale text analysis with resource constraints, AI-assisted BERTopic offers significant advantages in providing the enhanced interpretability and the improved semantic coherence for extracting valuable insights from textual data.</p>","PeriodicalId":12163,"journal":{"name":"Experimental Biology and Medicine","volume":"250 ","pages":"10389"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11906279/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Experimental Biology and Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/ebm.2025.10389","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Topic modeling is a crucial technique in natural language processing (NLP), enabling the extraction of latent themes from large text corpora. Traditional topic modeling, such as Latent Dirichlet Allocation (LDA), faces limitations in capturing the semantic relationships in the text document although it has been widely applied in text mining. BERTopic, created in 2022, leveraged advances in deep learning and can capture the contextual relationships between words. In this work, we integrated Artificial Intelligence (AI) modules to LDA and BERTopic and provided a comprehensive comparison on the analysis of prescription opioid-related cardiovascular risks in women. Opioid use can increase the risk of cardiovascular problems in women such as arrhythmia, hypotension etc. 1,837 abstracts were retrieved and downloaded from PubMed as of April 2024 using three Medical Subject Headings (MeSH) words: "opioid," "cardiovascular," and "women." Machine Learning of Language Toolkit (MALLET) was employed for the implementation of LDA. BioBERT was used for document embedding in BERTopic. Eighteen was selected as the optimal topic number for MALLET and 23 for BERTopic. ChatGPT-4-Turbo was integrated to interpret and compare the results. The short descriptions created by ChatGPT for each topic from LDA and BERTopic were highly correlated, and the performance accuracies of LDA and BERTopic were similar as determined by expert manual reviews of the abstracts grouped by their predominant topics. The results of the t-SNE (t-distributed Stochastic Neighbor Embedding) plots showed that the clusters created from BERTopic were more compact and well-separated, representing improved coherence and distinctiveness between the topics. Our findings indicated that AI algorithms could augment both traditional and contemporary topic modeling techniques. In addition, BERTopic has the connection port for ChatGPT-4-Turbo or other large language models in its algorithm for automatic interpretation, while with LDA interpretation must be manually, and needs special procedures for data pre-processing and stop words exclusion. Therefore, while LDA remains valuable for large-scale text analysis with resource constraints, AI-assisted BERTopic offers significant advantages in providing the enhanced interpretability and the improved semantic coherence for extracting valuable insights from textual data.

人工智能驱动的主题建模:比较LDA和BERTopic在分析女性阿片类药物相关心血管风险方面的作用。
主题建模是自然语言处理(NLP)中的一项关键技术,可以从大型文本语料库中提取潜在主题。传统的主题建模,如潜狄利克雷分配(Latent Dirichlet Allocation, LDA),虽然在文本挖掘中得到了广泛的应用,但在捕获文本文档中的语义关系方面存在局限性。BERTopic创建于2022年,利用深度学习的先进技术,可以捕捉单词之间的上下文关系。在这项工作中,我们将人工智能(AI)模块集成到LDA和BERTopic中,并对女性处方阿片类药物相关心血管风险的分析进行了全面比较。阿片类药物的使用会增加女性心血管疾病的风险,如心律失常、低血压等。截至2024年4月,从PubMed检索和下载了1837篇摘要,使用三个医学主题标题(MeSH):“阿片类药物”、“心血管”和“女性”。采用机器学习语言工具箱(MALLET)实现LDA。在BERTopic中使用BioBERT进行文档嵌入。MALLET的最佳主题数为18,BERTopic的最佳主题数为23。ChatGPT-4-Turbo集成用于解释和比较结果。ChatGPT从LDA和BERTopic中为每个主题创建的简短描述高度相关,并且LDA和BERTopic的性能准确性与专家手动审查按其主要主题分组的摘要所确定的相似。t-SNE (t-分布随机邻居嵌入)图的结果表明,BERTopic创建的聚类更加紧凑和分离良好,表明主题之间的一致性和差异性得到了提高。我们的研究结果表明,人工智能算法可以增强传统和现代主题建模技术。此外,BERTopic在其算法中有ChatGPT-4-Turbo或其他大型语言模型的连接端口,用于自动解释,而使用LDA解释必须手动进行,并且需要特殊的数据预处理和停词排除程序。因此,虽然LDA对于资源受限的大规模文本分析仍然有价值,但人工智能辅助的BERTopic在提供增强的可解释性和改进的语义一致性方面具有显著优势,可以从文本数据中提取有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Experimental Biology and Medicine
Experimental Biology and Medicine 医学-医学:研究与实验
CiteScore
6.00
自引率
0.00%
发文量
157
审稿时长
1 months
期刊介绍: Experimental Biology and Medicine (EBM) is a global, peer-reviewed journal dedicated to the publication of multidisciplinary and interdisciplinary research in the biomedical sciences. EBM provides both research and review articles as well as meeting symposia and brief communications. Articles in EBM represent cutting edge research at the overlapping junctions of the biological, physical and engineering sciences that impact upon the health and welfare of the world''s population. Topics covered in EBM include: Anatomy/Pathology; Biochemistry and Molecular Biology; Bioimaging; Biomedical Engineering; Bionanoscience; Cell and Developmental Biology; Endocrinology and Nutrition; Environmental Health/Biomarkers/Precision Medicine; Genomics, Proteomics, and Bioinformatics; Immunology/Microbiology/Virology; Mechanisms of Aging; Neuroscience; Pharmacology and Toxicology; Physiology; Stem Cell Biology; Structural Biology; Systems Biology and Microphysiological Systems; and Translational Research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信