Towards a stratified learning approach to predict future citation counts

Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries Pub Date : 2014-09-08 DOI:10.1109/JCDL.2014.6970190

Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, Animesh Mukherjee

{"title":"Towards a stratified learning approach to predict future citation counts","authors":"Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, Animesh Mukherjee","doi":"10.1109/JCDL.2014.6970190","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of predicting future citation count of a scientific article after a given time interval of its publication. To this end, we gather and conduct an exhaustive analysis on a dataset of more than 1.5 million scientific papers of computer science domain. On analysis of the dataset, we notice that the citation count of the articles over the years follows a diverse set of patterns; on closer inspection we identify six broad categories of citation patterns. This important observation motivates us to adopt stratified learning approach in the prediction task, whereby, we propose a two-stage prediction model - in the first stage, the model maps a query paper into one of the six categories, and then in the second stage a regression module is run only on the subpopulation corresponding to that category to predict the future citation count of the query paper. Experimental results show that the categorization of this huge dataset during the training phase leads to a remarkable improvement (around 50%) in comparison to the well-known baseline system.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"41 1","pages":"351-360"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"85","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL.2014.6970190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 85

Abstract

In this paper, we study the problem of predicting future citation count of a scientific article after a given time interval of its publication. To this end, we gather and conduct an exhaustive analysis on a dataset of more than 1.5 million scientific papers of computer science domain. On analysis of the dataset, we notice that the citation count of the articles over the years follows a diverse set of patterns; on closer inspection we identify six broad categories of citation patterns. This important observation motivates us to adopt stratified learning approach in the prediction task, whereby, we propose a two-stage prediction model - in the first stage, the model maps a query paper into one of the six categories, and then in the second stage a regression module is run only on the subpopulation corresponding to that category to predict the future citation count of the query paper. Experimental results show that the categorization of this huge dataset during the training phase leads to a remarkable improvement (around 50%) in comparison to the well-known baseline system.

查看原文本刊更多论文

采用分层学习方法预测未来的引文计数

在本文中，我们研究了科学论文在给定的出版时间间隔后预测其未来被引次数的问题。为此，我们收集了150多万篇计算机科学领域的科学论文数据集并进行了详尽的分析。通过对数据集的分析，我们注意到多年来文章的引用计数遵循不同的模式;通过仔细研究，我们确定了六大类引用模式。这一重要的观察结果促使我们在预测任务中采用分层学习方法，为此，我们提出了一个两阶段的预测模型——在第一阶段，模型将查询论文映射到六个类别中的一个，然后在第二阶段，只对该类别对应的子群运行回归模块来预测查询论文的未来被引用次数。实验结果表明，与众所周知的基线系统相比，在训练阶段对这个庞大的数据集进行分类可以显著提高(约50%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries

自引率

0.00%

发文量