CitationLDA++: an Extension of LDA for Discovering Topics in Document Network

Proceedings of the 9th International Symposium on Information and Communication Technology Pub Date : 2018-12-06 DOI:10.1145/3287921.3287930

T. Nguyen, P. Do

{"title":"CitationLDA++: an Extension of LDA for Discovering Topics in Document Network","authors":"T. Nguyen, P. Do","doi":"10.1145/3287921.3287930","DOIUrl":null,"url":null,"abstract":"Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.","PeriodicalId":448008,"journal":{"name":"Proceedings of the 9th International Symposium on Information and Communication Technology","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Symposium on Information and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3287921.3287930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.

查看原文本刊更多论文

CitationLDA++: LDA在文献网络中主题发现的扩展

随着电子科学出版物库的迅速发展，论文主题的自动识别为科研人员的研究提供了很大的帮助。潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)模型是一种基于语料库中词的共现性来发现文本中隐藏主题的常用方法。LDA算法在处理大型文档方面取得了较好的效果。然而，文章存储库通常只存储标题和摘要太短，LDA算法无法有效工作。在本文中，我们提出了CitationLDA++模型，该模型可以提高LDA算法基于标题或/和摘要和引文信息推断论文主题的性能。该模型基于被引论文的主题也反映了原论文的主题这一假设。在本研究中，我们将数据集分为两组。第一种是利用LDA算法构建先验知识源。二是CitationLDA++中使用的训练数据集。在Gibbs抽样推理过程中，CitationLDA++算法利用先验知识源和引文信息的主题分布来指导将主题分配给文本中的单词的过程。使用被引论文的主题有助于解决链接短文本中词共现的限制。在包含论文标题或/和摘要以及引文信息的AMiner数据集上进行实验，CitationLDA++算法比没有附加知识获得更好的困惑度度量。实验结果表明，引文信息可以提高LDA算法在没有全文的情况下发现论文主题的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 9th International Symposium on Information and Communication Technology

自引率

0.00%

发文量