Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics

Emil Rijcken , Kalliopi Zervanou , Pablo Mosteiro , Floortje Scheepers , Marco Spruit , Uzay Kaymak
{"title":"Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics","authors":"Emil Rijcken ,&nbsp;Kalliopi Zervanou ,&nbsp;Pablo Mosteiro ,&nbsp;Floortje Scheepers ,&nbsp;Marco Spruit ,&nbsp;Uzay Kaymak","doi":"10.1016/j.nlp.2024.100082","DOIUrl":null,"url":null,"abstract":"<div><p>Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: <em>algorithm selection</em> and <em>determining the number of topics</em>. To address these two issues, we propose the <em>topic specificity</em> metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100082"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S294971912400030X/pdfft?md5=af15e6c29d867b39aae58eedf84c6eda&pid=1-s2.0-S294971912400030X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S294971912400030X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: algorithm selection and determining the number of topics. To address these two issues, we propose the topic specificity metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.

主题特异性:用于选择算法和寻找合适主题数量的描述性指标
主题建模是发现语料库潜在结构的一项普遍任务,它可以确定一组代表文档基本主题的主题。尽管它很受欢迎,但其评估指标--一致性得分--存在问题,导致了两个共同的挑战:算法选择和确定主题数量。为了解决这两个问题,我们提出了主题特异性指标,该指标可以捕捉语料库中主题词的相对频率,并用来代表一个词的特异性。在这项工作中,我们首先制定了该指标。其次,我们证明了算法会在不同的特异性水平上训练主题。这一洞察力可用于解决算法选择问题,因为它能让用户区分并选择具有所需特异性水平的算法。最后,我们展示了 LDA、FLSA-W、NMF 和 LSI 的主题特异性与主题数量之间严格的正单调相关性。这种相关性可用于解决主题数量的选择问题,因为它允许用户根据自己的期望水平调整主题数量。此外,我们的描述性指标还为描述主题模型提供了一个新的视角,使人们能够更好地理解主题模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信