主题一致性度量:它们有多敏感?

J. Inf. Data Manag. Pub Date : 2022-10-03 DOI:10.5753/jidm.2022.2181

João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco

{"title":"主题一致性度量:它们有多敏感?","authors":"João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco","doi":"10.5753/jidm.2022.2181","DOIUrl":null,"url":null,"abstract":"Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Topic Coherence Metrics: How Sensitive Are They?\",\"authors\":\"João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco\",\"doi\":\"10.5753/jidm.2022.2181\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.\",\"PeriodicalId\":301338,\"journal\":{\"name\":\"J. Inf. Data Manag.\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Inf. Data Manag.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/jidm.2022.2181\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2022.2181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

主题建模方法从文档集合中提取最相关的词集(分组到所谓的主题中)。提取的主题可用于分析隐藏在集合中的潜在语义结构。这个任务本质上是无监督的(没有关于标签的信息)，因此评估发现的主题的质量是具有挑战性的。为了解决这个问题，已经提出了不同的无监督度量，其中一些接近人类感知，例如相干度量。此外，当面对主题中的噪声(即不相关的单词)时，度量标准的表现也不同。本文提出了一项探索性分析，以评估最先进的指标如何受到主题扰动的影响。通过扰动，我们的意思是将入侵词综合地插入主题中，以衡量度量指标处理噪声的能力。我们的发现强调了在度量敏感性上下文中被忽视的选择的重要性。我们发现一些主题建模指标对干扰高度敏感;其他人则可以在最小的干扰下处理嘈杂的话题。因此，我们根据敏感性对所选择的度量进行排序，并且作为贡献，我们相信结果可能有助于开发人员更好地评估发现的主题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Topic Coherence Metrics: How Sensitive Are They?

Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量