Comparing Different Term Weighting Schemas for Topic Modeling

2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) Pub Date : 2016-09-01 DOI:10.1109/SYNASC.2016.055

Ciprian-Octavian Truică, F. Rădulescu, A. Boicea

引用次数: 17

Abstract

Topic Modeling is a type of statistical model that tries to determine the topics present in a corpus of documents. The accuracy measures applied to clustering algorithm can also be used to assess the accuracy of topic modeling algorithms because determining topics for documents is similar with clustering them. This paper presents an experimental validation regarding the accuracy of Latent Dirichlet Allocation in comparison with Non-Negative Matrix Factorization and K-Means. The experiments use different weighting schemas when constructing the document-term matrix to determine if the accuracy of the algorithm improves. Two well known, already labeled text corpora are used for testing. The Purity and Adjusted Rand Index are used to evaluate the accuracy. Also, a time performance comparison regarding the run-time of these algorithms is presented.

查看原文本刊更多论文

主题建模中不同词权重模式的比较

主题建模是一种统计模型，它试图确定文档语料库中存在的主题。应用于聚类算法的精度度量也可用于评估主题建模算法的准确性，因为确定文档的主题与聚类它们类似。本文通过实验验证了潜狄利克雷分配方法与非负矩阵分解和K-Means方法的准确性。实验采用不同的加权模式来构建文档项矩阵，以确定算法的准确率是否有所提高。两个众所周知的，已经标记的文本语料库用于测试。使用纯度和调整后的兰德指数来评估准确性。同时，对这些算法在运行时的时间性能进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

自引率

0.00%

发文量