{"title":"Enhancing topic coherence and diversity in document embeddings using LLMs: A focus on BERTopic","authors":"Chibok Yang, Yangsok Kim","doi":"10.1016/j.eswa.2025.127517","DOIUrl":null,"url":null,"abstract":"<div><div>With the rapid growth of digital textual data, the need for systematic organization of large datasets has become critical. Topic modeling stands out as an effective approach for analyzing large volumes of text datasets. Neural Topic Models (NTMs) have been developed to overcome the limitations of traditional methods by using contextual embeddings, such as Bidirectional Encoder Representations from Transformers (BERT), to improve topic coherence. Recent advancements in Natural Language Processing (NLP) have further enhanced document processing capabilities through large language models (LLMs) such as LLaMA and the Generative Pre-trained Transformer (GPT). This research explores whether LLM embeddings within NTMs offer better performance compared to conventional models like Sentence-BERT (S-BERT) and DistilBERT. In particular, we examine the impact of text preprocessing on topic modeling. A comparative analysis is conducted using datasets from three domains, evaluating six topic models, including LLMs such as Falcon and LLaMA3, using three evaluation metrics. Results show that while no single model consistently excelled across all metrics, LLaMA3 demonstrated the best performance in coherence among the LLMs. In addition, overall topic modeling performance improved with the application of all six preprocessing techniques. LLaMA3 showed progressively better performance with additional preprocessing, confirming its stability and effectiveness in topic modeling. These findings suggest that LLMs can be reliable tools for topic identification across diverse datasets.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"281 ","pages":"Article 127517"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742501139X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the rapid growth of digital textual data, the need for systematic organization of large datasets has become critical. Topic modeling stands out as an effective approach for analyzing large volumes of text datasets. Neural Topic Models (NTMs) have been developed to overcome the limitations of traditional methods by using contextual embeddings, such as Bidirectional Encoder Representations from Transformers (BERT), to improve topic coherence. Recent advancements in Natural Language Processing (NLP) have further enhanced document processing capabilities through large language models (LLMs) such as LLaMA and the Generative Pre-trained Transformer (GPT). This research explores whether LLM embeddings within NTMs offer better performance compared to conventional models like Sentence-BERT (S-BERT) and DistilBERT. In particular, we examine the impact of text preprocessing on topic modeling. A comparative analysis is conducted using datasets from three domains, evaluating six topic models, including LLMs such as Falcon and LLaMA3, using three evaluation metrics. Results show that while no single model consistently excelled across all metrics, LLaMA3 demonstrated the best performance in coherence among the LLMs. In addition, overall topic modeling performance improved with the application of all six preprocessing techniques. LLaMA3 showed progressively better performance with additional preprocessing, confirming its stability and effectiveness in topic modeling. These findings suggest that LLMs can be reliable tools for topic identification across diverse datasets.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.