{"title":"基于PageRank的LDA主题质量表征","authors":"Sujatha Das Gollapalli, Xiaoli Li","doi":"10.1145/3234944.3234955","DOIUrl":null,"url":null,"abstract":"Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.","PeriodicalId":193631,"journal":{"name":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"90 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Using PageRank for Characterizing Topic Quality in LDA\",\"authors\":\"Sujatha Das Gollapalli, Xiaoli Li\",\"doi\":\"10.1145/3234944.3234955\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.\",\"PeriodicalId\":193631,\"journal\":{\"name\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"volume\":\"90 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3234944.3234955\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234944.3234955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Using PageRank for Characterizing Topic Quality in LDA
Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by LDA models is still not completely resolved. While various measures have been proposed to quantify the thematic coherence and interpretability of a topic extracted by LDA, they do not address this problem sufficiently. We observe that existing quality measures select top topic words based on their topic-word co-occurrence without considering word co-occurrences within the same context. We incorporate precisely this information by constructing topic-specific graphs capturing neighborhood of words in an LDA modeled corpus. Next, the PageRank algorithm is applied on these graphs to assign word importance scores based on centrality. We propose two measures to compute topic quality: (1) the Aggregate PageRank of Top-words of a topic and (2) the PageRank Centralization Index of a topic-specific word graph. Our experiments across three datasets show that unlike existing quality measures, our proposed measures are able to identify topics that are discriminative as well as interpretable and yield superior performance on both classification and intruder word identification tasks.