Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics

Natural Language Processing Journal Pub Date : 2024-06-04 DOI:10.1016/j.nlp.2024.100082

Emil Rijcken , Kalliopi Zervanou , Pablo Mosteiro , Floortje Scheepers , Marco Spruit , Uzay Kaymak

{"title":"Topic specificity: A descriptive metric for algorithm selection and finding the right number of topics","authors":"Emil Rijcken , Kalliopi Zervanou , Pablo Mosteiro , Floortje Scheepers , Marco Spruit , Uzay Kaymak","doi":"10.1016/j.nlp.2024.100082","DOIUrl":null,"url":null,"abstract":"<div><p>Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: <em>algorithm selection</em> and <em>determining the number of topics</em>. To address these two issues, we propose the <em>topic specificity</em> metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100082"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S294971912400030X/pdfft?md5=af15e6c29d867b39aae58eedf84c6eda&pid=1-s2.0-S294971912400030X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S294971912400030X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: algorithm selection and determining the number of topics. To address these two issues, we propose the topic specificity metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.

查看原文本刊更多论文

主题特异性：用于选择算法和寻找合适主题数量的描述性指标

主题建模是发现语料库潜在结构的一项普遍任务，它可以确定一组代表文档基本主题的主题。尽管它很受欢迎，但其评估指标--一致性得分--存在问题，导致了两个共同的挑战：算法选择和确定主题数量。为了解决这两个问题，我们提出了主题特异性指标，该指标可以捕捉语料库中主题词的相对频率，并用来代表一个词的特异性。在这项工作中，我们首先制定了该指标。其次，我们证明了算法会在不同的特异性水平上训练主题。这一洞察力可用于解决算法选择问题，因为它能让用户区分并选择具有所需特异性水平的算法。最后，我们展示了 LDA、FLSA-W、NMF 和 LSI 的主题特异性与主题数量之间严格的正单调相关性。这种相关性可用于解决主题数量的选择问题，因为它允许用户根据自己的期望水平调整主题数量。此外，我们的描述性指标还为描述主题模型提供了一个新的视角，使人们能够更好地理解主题模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量