“当前的主题建模评估指标是否足够?”利用多视角博弈论方法减轻主题建模评价指标的局限性

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Antônio Pereira , Felipe Viegas , Diego Roberto Colombo Dias , Elisa Tuler , Ana Cláudia Machado , Guilherme Fonseca , Marcos André Gonçalves , Leonardo Rocha
{"title":"“当前的主题建模评估指标是否足够?”利用多视角博弈论方法减轻主题建模评价指标的局限性","authors":"Antônio Pereira ,&nbsp;Felipe Viegas ,&nbsp;Diego Roberto Colombo Dias ,&nbsp;Elisa Tuler ,&nbsp;Ana Cláudia Machado ,&nbsp;Guilherme Fonseca ,&nbsp;Marcos André Gonçalves ,&nbsp;Leonardo Rocha","doi":"10.1016/j.knosys.2025.113634","DOIUrl":null,"url":null,"abstract":"<div><div>Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of <em>topic quality evaluation</em>, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’<em>ground truth</em> (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"320 ","pages":"Article 113634"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"“Are the current topic modeling evaluation metrics enough?” Mitigating the limitations of topic modeling evaluation metrics using a multi-perspective game theoretic approach\",\"authors\":\"Antônio Pereira ,&nbsp;Felipe Viegas ,&nbsp;Diego Roberto Colombo Dias ,&nbsp;Elisa Tuler ,&nbsp;Ana Cláudia Machado ,&nbsp;Guilherme Fonseca ,&nbsp;Marcos André Gonçalves ,&nbsp;Leonardo Rocha\",\"doi\":\"10.1016/j.knosys.2025.113634\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of <em>topic quality evaluation</em>, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’<em>ground truth</em> (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"320 \",\"pages\":\"Article 113634\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095070512500680X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095070512500680X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

主题建模(Topic Modeling, TM)通过从文档中发现语义主题,帮助从大量文本数据中提取和组织信息。在本文中,我们深入研究了主题质量评估的问题,通过评估主题生成过程的整体质量来推动TM领域的进步。传统的TM指标通过严格评估组成主题的单词来捕获主题的质量,无论是在句法上(例如NPMI, TF-IDF Coherence)还是在语义上(例如WEP)。在这里,我们调查了我们是否正在接近当前评估指标可以评估的关于TM质量的极限。为此,我们进行了全面的实验评估,考虑了三个广泛使用的数据集(ACM, 20News, WOS和Books),其中集合文档的语义类(主题)的自然组织确实存在。我们将四种传统和最先进的TM技术(即LDA、NMF、CluWords、BERTopic和TopicGPT)生成的主题质量与每个集合的“自然主题结构”进行了对比。我们的结果表明,尽管当前指标很重要,但它们不能捕捉到TM任务的一些重要的特殊方面,在这种情况下,主题将文档空间的结构组织归纳为不同语义组的能力,表明需要考虑这些方面的新指标。从这个意义上说,我们建议将通常用于评估聚类算法的指标纳入TM评估过程,依赖于TM和聚类任务之间的一些共性。结果强调了聚类度量在区分TM技术结果与数据集的基础真理(类组织)的有效性。然而,采用额外的评估度量意味着扩展分析空间。因此,作为第三个贡献,我们建议将各种指标整合到一个统一的框架中,使用博弈论进行决策,特别是多属性效用理论(MAUT),它基于多个标准的加权偏好来评估选项,其中越接近1,标准之间的一致性越大。我们的实验结果表明,MAUT可以更精确地评估TM的质量。CluWords在20News、ACM和WOS集合中获得了最佳的MAUT值(分别为0.9913、0.9571和0.8684)。虽然ACM集合中的指标之间存在高度的一致性,表明CluWords是最佳解决方案,但WOS集合中的指标之间的差异很小。在这种情况下,单独评估每个指标会得出不同的结论,但MAUT向我们展示了CluWords作为一个整体是最一致的,突出了探索用于文本表示的词嵌入和矩阵分解策略来诱导主题的好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
“Are the current topic modeling evaluation metrics enough?” Mitigating the limitations of topic modeling evaluation metrics using a multi-perspective game theoretic approach
Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of topic quality evaluation, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’ground truth (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信