GSDMM与BERTopic在短文本主题建模中的探索性分析

2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP) Pub Date : 2022-12-23 DOI:10.1109/CCIP57447.2022.10058687

Abhinandan Udupa, K. N. Adarsh, Anvitha Aravinda, Neelam H Godihal, N. Kayarvizhy

{"title":"GSDMM与BERTopic在短文本主题建模中的探索性分析","authors":"Abhinandan Udupa, K. N. Adarsh, Anvitha Aravinda, Neelam H Godihal, N. Kayarvizhy","doi":"10.1109/CCIP57447.2022.10058687","DOIUrl":null,"url":null,"abstract":"Topic models may be a useful tool for locating latent subjects in collections of documents. Short text clustering has become a more important task as social networking sites like Twitter have gained popularity. Short text is characterised by high sparsity, high-dimensionality, and large-volume. These characteristics are challenging to overcome. Two of the most well-known short text modelling algorithms are BERTopic and the Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM). GSDMM is a topic model which can infer the count of topic clusters automatically with a good compromise between the fullness and uniformity of the clustering results, and is fast to converge. BERTopic is a neural topic model that extracts coherent topic representations based on the semantic similarity of words and phrases in the and clustering with the help of a class-based form of TF-IDF. We compare these two algorithms in this paper to determine which model is more effective in short text topic modelling.","PeriodicalId":309964,"journal":{"name":"2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Exploratory Analysis of GSDMM and BERTopic on Short Text Topic Modelling\",\"authors\":\"Abhinandan Udupa, K. N. Adarsh, Anvitha Aravinda, Neelam H Godihal, N. Kayarvizhy\",\"doi\":\"10.1109/CCIP57447.2022.10058687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models may be a useful tool for locating latent subjects in collections of documents. Short text clustering has become a more important task as social networking sites like Twitter have gained popularity. Short text is characterised by high sparsity, high-dimensionality, and large-volume. These characteristics are challenging to overcome. Two of the most well-known short text modelling algorithms are BERTopic and the Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM). GSDMM is a topic model which can infer the count of topic clusters automatically with a good compromise between the fullness and uniformity of the clustering results, and is fast to converge. BERTopic is a neural topic model that extracts coherent topic representations based on the semantic similarity of words and phrases in the and clustering with the help of a class-based form of TF-IDF. We compare these two algorithms in this paper to determine which model is more effective in short text topic modelling.\",\"PeriodicalId\":309964,\"journal\":{\"name\":\"2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCIP57447.2022.10058687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCIP57447.2022.10058687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

主题模型可能是在文档集合中定位潜在主题的有用工具。随着Twitter等社交网站的普及，短文本聚类已成为一项更为重要的任务。短文本具有高稀疏性、高维性和大容量的特点。这些特点很难克服。两种最著名的短文本建模算法是BERTopic和Gibbs Sampling Dirichlet多项式混合模型(GSDMM)。GSDMM是一种自动推断主题聚类数量的主题模型，在聚类结果的完备性和均匀性之间取得了很好的折衷，收敛速度快。BERTopic是一种神经主题模型，它利用基于类的TF-IDF形式，根据和聚类中单词和短语的语义相似性提取连贯的主题表示。我们在本文中比较了这两种算法，以确定哪种模型在短文本主题建模中更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Exploratory Analysis of GSDMM and BERTopic on Short Text Topic Modelling

Topic models may be a useful tool for locating latent subjects in collections of documents. Short text clustering has become a more important task as social networking sites like Twitter have gained popularity. Short text is characterised by high sparsity, high-dimensionality, and large-volume. These characteristics are challenging to overcome. Two of the most well-known short text modelling algorithms are BERTopic and the Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM). GSDMM is a topic model which can infer the count of topic clusters automatically with a good compromise between the fullness and uniformity of the clustering results, and is fast to converge. BERTopic is a neural topic model that extracts coherent topic representations based on the semantic similarity of words and phrases in the and clustering with the help of a class-based form of TF-IDF. We compare these two algorithms in this paper to determine which model is more effective in short text topic modelling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 Fourth International Conference on Cognitive Computing and Information Processing (CCIP)

自引率

0.00%

发文量