{"title":"Comparing Different Term Weighting Schemas for Topic Modeling","authors":"Ciprian-Octavian Truică, F. Rădulescu, A. Boicea","doi":"10.1109/SYNASC.2016.055","DOIUrl":null,"url":null,"abstract":"Topic Modeling is a type of statistical model that tries to determine the topics present in a corpus of documents. The accuracy measures applied to clustering algorithm can also be used to assess the accuracy of topic modeling algorithms because determining topics for documents is similar with clustering them. This paper presents an experimental validation regarding the accuracy of Latent Dirichlet Allocation in comparison with Non-Negative Matrix Factorization and K-Means. The experiments use different weighting schemas when constructing the document-term matrix to determine if the accuracy of the algorithm improves. Two well known, already labeled text corpora are used for testing. The Purity and Adjusted Rand Index are used to evaluate the accuracy. Also, a time performance comparison regarding the run-time of these algorithms is presented.","PeriodicalId":268635,"journal":{"name":"2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2016.055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Topic Modeling is a type of statistical model that tries to determine the topics present in a corpus of documents. The accuracy measures applied to clustering algorithm can also be used to assess the accuracy of topic modeling algorithms because determining topics for documents is similar with clustering them. This paper presents an experimental validation regarding the accuracy of Latent Dirichlet Allocation in comparison with Non-Negative Matrix Factorization and K-Means. The experiments use different weighting schemas when constructing the document-term matrix to determine if the accuracy of the algorithm improves. Two well known, already labeled text corpora are used for testing. The Purity and Adjusted Rand Index are used to evaluate the accuracy. Also, a time performance comparison regarding the run-time of these algorithms is presented.