Jamie Zimmermann , Lance E. Champagne , John M. Dickens , Benjamin T. Hazen
{"title":"Approaches to improve preprocessing for Latent Dirichlet Allocation topic modeling","authors":"Jamie Zimmermann , Lance E. Champagne , John M. Dickens , Benjamin T. Hazen","doi":"10.1016/j.dss.2024.114310","DOIUrl":null,"url":null,"abstract":"<div><p>As a part of natural language processing (NLP), the intent of topic modeling is to identify topics in textual corpora with limited human input. Current topic modeling techniques, like Latent Dirichlet Allocation (LDA), are limited in the pre-processing steps and currently require human judgement, increasing analysis time and opportunities for error. The purpose of this research is to allay some of those limitations by introducing new approaches to improve coherence without adding computational complexity and provide an objective method for determining the number of topics within a corpus. First, we identify a requirement for a more robust stop words list and introduce a new dimensionality-reduction heuristic that exploits the number of words within a document to infer importance to word choice. Second, we develop an eigenvalue technique to determine the number of topics within a corpus. Third, we combine all of these techniques into the Zimm Approach, which produces higher quality results than LDA in determining the number of topics within a corpus. The Zimm Approach, when tested against various subsets of the 20newsgroup dataset, produced the correct number of topics in 7 of 9 subsets vs. 0 of 9 using highest coherence value produced by LDA.</p></div>","PeriodicalId":55181,"journal":{"name":"Decision Support Systems","volume":"185 ","pages":"Article 114310"},"PeriodicalIF":6.7000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Support Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016792362400143X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
As a part of natural language processing (NLP), the intent of topic modeling is to identify topics in textual corpora with limited human input. Current topic modeling techniques, like Latent Dirichlet Allocation (LDA), are limited in the pre-processing steps and currently require human judgement, increasing analysis time and opportunities for error. The purpose of this research is to allay some of those limitations by introducing new approaches to improve coherence without adding computational complexity and provide an objective method for determining the number of topics within a corpus. First, we identify a requirement for a more robust stop words list and introduce a new dimensionality-reduction heuristic that exploits the number of words within a document to infer importance to word choice. Second, we develop an eigenvalue technique to determine the number of topics within a corpus. Third, we combine all of these techniques into the Zimm Approach, which produces higher quality results than LDA in determining the number of topics within a corpus. The Zimm Approach, when tested against various subsets of the 20newsgroup dataset, produced the correct number of topics in 7 of 9 subsets vs. 0 of 9 using highest coherence value produced by LDA.
期刊介绍:
The common thread of articles published in Decision Support Systems is their relevance to theoretical and technical issues in the support of enhanced decision making. The areas addressed may include foundations, functionality, interfaces, implementation, impacts, and evaluation of decision support systems (DSSs).