基于PageRank的LDA主题质量表征

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval Pub Date : 2018-09-10 DOI:10.1145/3234944.3234955

Sujatha Das Gollapalli, Xiaoli Li

{"title":"基于PageRank的LDA主题质量表征","authors":"Sujatha Das Gollapalli, Xiaoli Li","doi":"10.1145/3234944.3234955","DOIUrl":null,"url":null,"abstract":"Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.","PeriodicalId":193631,"journal":{"name":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"90 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Using PageRank for Characterizing Topic Quality in LDA\",\"authors\":\"Sujatha Das Gollapalli, Xiaoli Li\",\"doi\":\"10.1145/3234944.3234955\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.\",\"PeriodicalId\":193631,\"journal\":{\"name\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"volume\":\"90 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3234944.3234955\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234944.3234955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

基于潜狄利克雷分配(Latent Dirichlet Allocation, LDA)的主题模型被有效地应用于各种信息检索和数据挖掘任务。尽管LDA模型得到了广泛的应用和普及，但LDA模型提取的主题质量评估问题仍然没有完全解决。虽然已经提出了各种措施来量化LDA提取的主题的主题一致性和可解释性，但它们没有充分解决这个问题。我们观察到，现有的质量测量方法是根据主题词共现来选择热门主题词，而不考虑同一上下文中的词共现。我们通过构建特定主题的图来捕获LDA模型语料库中单词的邻域，从而精确地整合这些信息。接下来，在这些图上应用PageRank算法，根据中心性分配单词重要性分数。我们提出了两种方法来计算主题质量:(1)主题的Top-words的Aggregate PageRank和(2)特定主题词图的PageRank Centralization Index。我们在三个数据集上的实验表明，与现有的质量度量不同，我们提出的度量能够识别具有区别性和可解释性的主题，并在分类和入侵者词识别任务上产生卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using PageRank for Characterizing Topic Quality in LDA

Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

自引率

0.00%

发文量