通过数据驱动分析确定潜在德里赫特分配模型的最佳主题数 "K

Q2 Mathematics
Astha Goyal, Indu Kashyap
{"title":"通过数据驱动分析确定潜在德里赫特分配模型的最佳主题数 \"K","authors":"Astha Goyal, Indu Kashyap","doi":"10.11591/ijeecs.v35.i1.pp310-322","DOIUrl":null,"url":null,"abstract":"Topic modeling is an unsupervised machine learning technique successfully used to classify and retrieve textual data. However, the performance of topic models is sensitive to selecting optimal hyperparameters, the number of topics 'K' and Dirichlet priors 'α' and 'β.' This data-driven analysis aims to determine the optimum number of topics, 'K,' within the latent Dirichlet allocation (LDA) model. This work utilizes three datasets, namely 20-Newsgroups news articles, Wikipedia articles, and Web of Science containing science articles, to assess and compare various 'K' values through the grid search approach. The grid search approach finds the best combination of hyperparameter values by trying all possible combinations to see which performs best. This research seeks to identify the 'K' that optimizes topic relevance, coherence, and model performance by leveraging statistical metrics, such as coherence scores, perplexity, and topic distribution quality. Through empirical analysis and rigorous evaluation, this work provides valuable insights for determining the ideal 'K' for LDA models.","PeriodicalId":13480,"journal":{"name":"Indonesian Journal of Electrical Engineering and Computer Science","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A data-driven analysis to determine the optimal number of topics 'K' for latent Dirichlet allocation model\",\"authors\":\"Astha Goyal, Indu Kashyap\",\"doi\":\"10.11591/ijeecs.v35.i1.pp310-322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic modeling is an unsupervised machine learning technique successfully used to classify and retrieve textual data. However, the performance of topic models is sensitive to selecting optimal hyperparameters, the number of topics 'K' and Dirichlet priors 'α' and 'β.' This data-driven analysis aims to determine the optimum number of topics, 'K,' within the latent Dirichlet allocation (LDA) model. This work utilizes three datasets, namely 20-Newsgroups news articles, Wikipedia articles, and Web of Science containing science articles, to assess and compare various 'K' values through the grid search approach. The grid search approach finds the best combination of hyperparameter values by trying all possible combinations to see which performs best. This research seeks to identify the 'K' that optimizes topic relevance, coherence, and model performance by leveraging statistical metrics, such as coherence scores, perplexity, and topic distribution quality. Through empirical analysis and rigorous evaluation, this work provides valuable insights for determining the ideal 'K' for LDA models.\",\"PeriodicalId\":13480,\"journal\":{\"name\":\"Indonesian Journal of Electrical Engineering and Computer Science\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Indonesian Journal of Electrical Engineering and Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11591/ijeecs.v35.i1.pp310-322\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indonesian Journal of Electrical Engineering and Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11591/ijeecs.v35.i1.pp310-322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

摘要

主题建模是一种无监督机器学习技术,成功地用于文本数据的分类和检索。然而,主题模型的性能对选择最佳超参数(主题数 "K "和 Dirichlet 前验 "α "和 "β")非常敏感。这项数据驱动的分析旨在确定潜在 Dirichlet 分配(LDA)模型中的最佳主题数 "K"。这项工作利用三个数据集,即 20-Newsgroups 新闻文章、维基百科文章和包含科学文章的 Web of Science,通过网格搜索方法评估和比较各种 "K "值。网格搜索法通过尝试所有可能的组合,找出超参数值的最佳组合,看看哪种组合表现最好。本研究试图通过利用统计指标(如一致性得分、复杂性和话题分布质量)来确定能优化话题相关性、一致性和模型性能的 "K "值。通过实证分析和严格评估,这项工作为确定 LDA 模型的理想 "K "提供了宝贵的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A data-driven analysis to determine the optimal number of topics 'K' for latent Dirichlet allocation model
Topic modeling is an unsupervised machine learning technique successfully used to classify and retrieve textual data. However, the performance of topic models is sensitive to selecting optimal hyperparameters, the number of topics 'K' and Dirichlet priors 'α' and 'β.' This data-driven analysis aims to determine the optimum number of topics, 'K,' within the latent Dirichlet allocation (LDA) model. This work utilizes three datasets, namely 20-Newsgroups news articles, Wikipedia articles, and Web of Science containing science articles, to assess and compare various 'K' values through the grid search approach. The grid search approach finds the best combination of hyperparameter values by trying all possible combinations to see which performs best. This research seeks to identify the 'K' that optimizes topic relevance, coherence, and model performance by leveraging statistical metrics, such as coherence scores, perplexity, and topic distribution quality. Through empirical analysis and rigorous evaluation, this work provides valuable insights for determining the ideal 'K' for LDA models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.90
自引率
0.00%
发文量
782
期刊介绍: The aim of Indonesian Journal of Electrical Engineering and Computer Science (formerly TELKOMNIKA Indonesian Journal of Electrical Engineering) is to publish high-quality articles dedicated to all aspects of the latest outstanding developments in the field of electrical engineering. Its scope encompasses the applications of Telecommunication and Information Technology, Applied Computing and Computer, Instrumentation and Control, Electrical (Power), Electronics Engineering and Informatics which covers, but not limited to, the following scope: Signal Processing[...] Electronics[...] Electrical[...] Telecommunication[...] Instrumentation & Control[...] Computing and Informatics[...]
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信