A frequent keyword-set based algorithm for topic modeling and clustering of research papers

2011 3rd Conference on Data Mining and Optimization (DMO) Pub Date : 2011-06-28 DOI:10.1109/DMO.2011.5976511

Kumar Shubankar, A. Singh, Vikram Pudi

引用次数: 25

Abstract

In this paper we introduce a novel and efficient approach to detect topics in a large corpus of research papers. With rapidly growing size of academic literature, the problem of topic detection has become a very challenging task. We present a unique approach that uses closed frequent keyword-set to form topics. Our approach also provides a natural method to cluster the research papers into hierarchical, overlapping clusters using topic as similarity measure. To rank the research papers in the topic cluster, we devise a modified PageRank algorithm that assigns an authoritative score to each research paper by considering the sub-graph in which the research paper appears. We test our algorithms on the DBLP dataset and experimentally show that our algorithms are fast, effective and scalable.

查看原文本刊更多论文

基于频繁关键词集的研究论文主题建模与聚类算法

在本文中，我们介绍了一种新颖而有效的方法来检测大型研究论文语料库中的主题。随着学术文献数量的迅速增长，主题检测问题已经成为一项非常具有挑战性的任务。我们提出了一种独特的方法，使用封闭的频繁关键字集来形成主题。我们的方法还提供了一种自然的方法，将研究论文聚类成分层的，重叠的聚类，使用主题作为相似性度量。为了在主题聚类中对研究论文进行排名，我们设计了一种改进的PageRank算法，该算法通过考虑研究论文出现的子图，为每篇研究论文分配权威分数。我们在DBLP数据集上测试了我们的算法，实验表明我们的算法是快速、有效和可扩展的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 3rd Conference on Data Mining and Optimization (DMO)

自引率

0.00%

发文量