Enhancing topic coherence and diversity in document embeddings using LLMs: A focus on BERTopic

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chibok Yang, Yangsok Kim
{"title":"Enhancing topic coherence and diversity in document embeddings using LLMs: A focus on BERTopic","authors":"Chibok Yang,&nbsp;Yangsok Kim","doi":"10.1016/j.eswa.2025.127517","DOIUrl":null,"url":null,"abstract":"<div><div>With the rapid growth of digital textual data, the need for systematic organization of large datasets has become critical. Topic modeling stands out as an effective approach for analyzing large volumes of text datasets. Neural Topic Models (NTMs) have been developed to overcome the limitations of traditional methods by using contextual embeddings, such as Bidirectional Encoder Representations from Transformers (BERT), to improve topic coherence. Recent advancements in Natural Language Processing (NLP) have further enhanced document processing capabilities through large language models (LLMs) such as LLaMA and the Generative Pre-trained Transformer (GPT). This research explores whether LLM embeddings within NTMs offer better performance compared to conventional models like Sentence-BERT (S-BERT) and DistilBERT. In particular, we examine the impact of text preprocessing on topic modeling. A comparative analysis is conducted using datasets from three domains, evaluating six topic models, including LLMs such as Falcon and LLaMA3, using three evaluation metrics. Results show that while no single model consistently excelled across all metrics, LLaMA3 demonstrated the best performance in coherence among the LLMs. In addition, overall topic modeling performance improved with the application of all six preprocessing techniques. LLaMA3 showed progressively better performance with additional preprocessing, confirming its stability and effectiveness in topic modeling. These findings suggest that LLMs can be reliable tools for topic identification across diverse datasets.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"281 ","pages":"Article 127517"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742501139X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

With the rapid growth of digital textual data, the need for systematic organization of large datasets has become critical. Topic modeling stands out as an effective approach for analyzing large volumes of text datasets. Neural Topic Models (NTMs) have been developed to overcome the limitations of traditional methods by using contextual embeddings, such as Bidirectional Encoder Representations from Transformers (BERT), to improve topic coherence. Recent advancements in Natural Language Processing (NLP) have further enhanced document processing capabilities through large language models (LLMs) such as LLaMA and the Generative Pre-trained Transformer (GPT). This research explores whether LLM embeddings within NTMs offer better performance compared to conventional models like Sentence-BERT (S-BERT) and DistilBERT. In particular, we examine the impact of text preprocessing on topic modeling. A comparative analysis is conducted using datasets from three domains, evaluating six topic models, including LLMs such as Falcon and LLaMA3, using three evaluation metrics. Results show that while no single model consistently excelled across all metrics, LLaMA3 demonstrated the best performance in coherence among the LLMs. In addition, overall topic modeling performance improved with the application of all six preprocessing techniques. LLaMA3 showed progressively better performance with additional preprocessing, confirming its stability and effectiveness in topic modeling. These findings suggest that LLMs can be reliable tools for topic identification across diverse datasets.
利用法学硕士增强文档嵌入中的主题一致性和多样性:以BERTopic为重点
随着数字文本数据的快速增长,对大型数据集进行系统组织的需求变得至关重要。主题建模是分析大量文本数据集的一种有效方法。神经主题模型(ntm)的发展是为了克服传统方法的局限性,通过使用上下文嵌入,如来自变形金刚的双向编码器表示(BERT),来提高主题一致性。自然语言处理(NLP)的最新进展通过大型语言模型(llm),如LLaMA和生成预训练转换器(GPT),进一步增强了文档处理能力。本研究探讨了与传统模型(如Sentence-BERT (S-BERT)和DistilBERT)相比,ntm中的LLM嵌入是否提供了更好的性能。特别地,我们研究了文本预处理对主题建模的影响。使用三个领域的数据集进行了比较分析,使用三个评估指标评估了六个主题模型,包括Falcon和LLaMA3等法学硕士。结果表明,虽然没有单一模型在所有指标上都表现出色,但LLaMA3在llm之间的一致性方面表现最佳。此外,这六种预处理技术的应用提高了主题建模的整体性能。通过进一步的预处理,LLaMA3的性能逐渐提高,证实了其在主题建模中的稳定性和有效性。这些发现表明llm可以成为跨不同数据集进行主题识别的可靠工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信