An integrated clustering and BERT framework for improved topic modeling.

International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management Pub Date : 2023-01-01 Epub Date: 2023-05-06 DOI:10.1007/s41870-023-01268-w

Lijimol George, P Sumathy

{"title":"An integrated clustering and BERT framework for improved topic modeling.","authors":"Lijimol George, P Sumathy","doi":"10.1007/s41870-023-01268-w","DOIUrl":null,"url":null,"abstract":"<p><p>Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.</p>","PeriodicalId":73455,"journal":{"name":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","volume":"15 4","pages":"2187-2195"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163298/pdf/","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-023-01268-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/5/6 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.

Abstract Image

查看原文本刊更多论文

用于改进主题建模的集成集群和BERT框架。

主题建模是一种机器学习技术，广泛用于自然语言处理（NLP）应用程序，以推断非结构化文本数据中的主题。潜在狄利克雷分配（LDA）是最常用的主题建模技术之一，可以自动检测大量文本文档中的主题。然而，基于LDA的主题模型本身并不总是提供有希望的结果。聚类是一种有效的无监督机器学习算法，广泛应用于从非结构化文本数据中提取信息和主题建模等领域。详细研究了基于降维聚类的主题建模中来自变换器的双向编码器表示（BERT）和潜在狄利克雷分配（LDA）的混合模型。由于聚类算法计算复杂，复杂度随着特征数量的增加而增加，因此还执行了基于PCA、t-SNE和UMAP的降维方法。最后，提出了一个基于BERT和LDA的统一聚类框架，用于从海量文本语料库中挖掘一组有意义的主题。通过在基准数据集上模拟用户输入，实验证明了使用BERT和LDA的集群知情主题建模框架的有效性。实验结果表明，降维聚类有助于推断出更连贯的主题，因此这种统一的聚类和基于BERT-LDA的方法可以有效地用于构建主题建模应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International journal of information technology : an official journal of Bharati Vidyapeeth's Institute of Computer Applications and Management

自引率

0.00%

发文量