Antônio Pereira , Felipe Viegas , Diego Roberto Colombo Dias , Elisa Tuler , Ana Cláudia Machado , Guilherme Fonseca , Marcos André Gonçalves , Leonardo Rocha
{"title":"“当前的主题建模评估指标是否足够?”利用多视角博弈论方法减轻主题建模评价指标的局限性","authors":"Antônio Pereira , Felipe Viegas , Diego Roberto Colombo Dias , Elisa Tuler , Ana Cláudia Machado , Guilherme Fonseca , Marcos André Gonçalves , Leonardo Rocha","doi":"10.1016/j.knosys.2025.113634","DOIUrl":null,"url":null,"abstract":"<div><div>Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of <em>topic quality evaluation</em>, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’<em>ground truth</em> (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"320 ","pages":"Article 113634"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"“Are the current topic modeling evaluation metrics enough?” Mitigating the limitations of topic modeling evaluation metrics using a multi-perspective game theoretic approach\",\"authors\":\"Antônio Pereira , Felipe Viegas , Diego Roberto Colombo Dias , Elisa Tuler , Ana Cláudia Machado , Guilherme Fonseca , Marcos André Gonçalves , Leonardo Rocha\",\"doi\":\"10.1016/j.knosys.2025.113634\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of <em>topic quality evaluation</em>, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’<em>ground truth</em> (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"320 \",\"pages\":\"Article 113634\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095070512500680X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095070512500680X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
“Are the current topic modeling evaluation metrics enough?” Mitigating the limitations of topic modeling evaluation metrics using a multi-perspective game theoretic approach
Topic Modeling (TM) helps extract and organize information from large amounts of textual data by discovering semantic topics from documents. In this article, we delve into issues of topic quality evaluation, responsible for driving the advances in the TM field by assessing the overall quality of the topic generation process. Traditional TM metrics capture the quality of topics by strictly evaluating the words that make up the topics, either syntactically (e.g., NPMI, TF-IDF Coherence) or semantically (e.g., WEP). Here, we investigate whether we are approaching the limits of what the current evaluation metrics can assess regarding TM quality. For this, we perform a comprehensive experimental evaluation, considering three widely used datasets (ACM, 20News, WOS and Books) for which a natural organization of the collection’s documents into semantic classes (topics) does exist. We contrast the quality of topics generated by four traditional and state-of-the-art TM techniques (i.e., LDA, NMF, CluWords, BERTopic and TopicGPT) with each collection’s “natural topic structure”. Our results show that, despite the importance of the current metrics, they could not capture some important idiosyncratic aspects of the TM task, in the case, the capability of the topics to induce a structural organization of the document space into distinct semantic groups, indicating the need for new metrics that consider such aspects. In this sense, we propose incorporating metrics commonly used to evaluate clustering algorithms into the TM evaluation process, relying on some commonalities between TM and clustering tasks. Results highlight the effectiveness of clustering metrics in distinguishing the results of TM techniques when compared to the datasets’ground truth (class organization). However, adopting additional evaluation metrics implies expanding the analysis space. Thus, as a third contribution, we propose consolidating the various metrics into a unified framework, using Game Theory for decision-making, specifically Multi-Attribute Utility Theory (MAUT), which evaluates options based on weighted preferences across multiple criteria, on which the closer to 1, the greater the agreement between the criteria. Our experimental results demonstrate that MAUT allows a more precise assessment of TM quality. The CluWords achieved the best MAUT values for 20News, ACM and WOS collections (i.e., 0.9913, 0.9571 and 0.8684, respectively). While there is a high level of agreement between the metrics in the ACM collection, indicating CluWords as the best solution, there is a low divergence between the metrics in the WOS collection. In this case, evaluating each metric individually would lead to different conclusions, but MAUT shows us that CluWords is the most consistent as a whole, highlighting the benefits of exploring word embeddings for text representations and matrix factorization strategies to induce topics.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.