Predicting citation impact of research papers using GPT and other text embeddings

arXiv - CS - Digital Libraries Pub Date : 2024-07-29 DOI:arxiv-2407.19942

Adilson Vital Jr., Filipi N. Silva, Osvaldo N. Oliveira Jr., Diego R. Amancio

{"title":"Predicting citation impact of research papers using GPT and other text embeddings","authors":"Adilson Vital Jr., Filipi N. Silva, Osvaldo N. Oliveira Jr., Diego R. Amancio","doi":"arxiv-2407.19942","DOIUrl":null,"url":null,"abstract":"The impact of research papers, typically measured in terms of citation\ncounts, depends on several factors, including the reputation of the authors,\njournals, and institutions, in addition to the quality of the scientific work.\nIn this paper, we present an approach that combines natural language processing\nand machine learning to predict the impact of papers in a specific journal. Our\nfocus is on the text, which should correlate with impact and the topics covered\nin the research. We employed a dataset of over 40,000 articles from ACS Applied\nMaterials and Interfaces spanning from 2012 to 2022. The data was processed\nusing various text embedding techniques and classified with supervised machine\nlearning algorithms. Papers were categorized into the top 20% most cited within\nthe journal, using both yearly and cumulative citation counts as metrics. Our\nanalysis reveals that the method employing generative pre-trained transformers\n(GPT) was the most efficient for embedding, while the random forest algorithm\nexhibited the best predictive power among the machine learning algorithms. An\noptimized accuracy of 80\\% in predicting whether a paper was among the top 20%\nmost cited was achieved for the cumulative citation count when abstracts were\nprocessed. This accuracy is noteworthy, considering that author, institution,\nand early citation pattern information were not taken into account. The\naccuracy increased only slightly when the full texts of the papers were\nprocessed. Also significant is the finding that a simpler embedding technique,\nterm frequency-inverse document frequency (TFIDF), yielded performance close to\nthat of GPT. Since TFIDF captures the topics of the paper we infer that, apart\nfrom considering author and institution biases, citation counts for the\nconsidered journal may be predicted by identifying topics and \"reading\" the\nabstract of a paper.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.19942","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The impact of research papers, typically measured in terms of citation counts, depends on several factors, including the reputation of the authors, journals, and institutions, in addition to the quality of the scientific work. In this paper, we present an approach that combines natural language processing and machine learning to predict the impact of papers in a specific journal. Our focus is on the text, which should correlate with impact and the topics covered in the research. We employed a dataset of over 40,000 articles from ACS Applied Materials and Interfaces spanning from 2012 to 2022. The data was processed using various text embedding techniques and classified with supervised machine learning algorithms. Papers were categorized into the top 20% most cited within the journal, using both yearly and cumulative citation counts as metrics. Our analysis reveals that the method employing generative pre-trained transformers (GPT) was the most efficient for embedding, while the random forest algorithm exhibited the best predictive power among the machine learning algorithms. An optimized accuracy of 80\% in predicting whether a paper was among the top 20% most cited was achieved for the cumulative citation count when abstracts were processed. This accuracy is noteworthy, considering that author, institution, and early citation pattern information were not taken into account. The accuracy increased only slightly when the full texts of the papers were processed. Also significant is the finding that a simpler embedding technique, term frequency-inverse document frequency (TFIDF), yielded performance close to that of GPT. Since TFIDF captures the topics of the paper we infer that, apart from considering author and institution biases, citation counts for the considered journal may be predicted by identifying topics and "reading" the abstract of a paper.

查看原文本刊更多论文

利用 GPT 和其他文本嵌入预测研究论文的引文影响力

研究论文的影响力通常以引用次数来衡量，它取决于多个因素，包括作者、期刊和机构的声誉，以及科研工作的质量。在本文中，我们介绍了一种结合自然语言处理和机器学习的方法，用于预测特定期刊论文的影响力。我们的重点是文本，它应与影响力和研究主题相关联。我们使用了一个数据集，该数据集收录了从 2012 年到 2022 年期间《ACS 应用材料与界面》杂志上的 40,000 多篇文章。我们使用各种文本嵌入技术对数据进行了处理，并使用有监督的机器学习算法对数据进行了分类。使用年度和累计引用次数作为衡量标准，将论文归类为期刊内被引用次数最多的前 20%。我们的分析表明，采用生成式预训练转换器（GPT）的方法是最有效的嵌入方法，而随机森林算法在机器学习算法中表现出最佳的预测能力。对摘要进行处理后，在预测一篇论文是否属于被引用次数最多的前 20% 时，达到了 80% 的最佳准确率。考虑到作者、机构和早期引用模式信息未被考虑在内，这一准确率是值得注意的。在处理论文全文时，准确率仅略有提高。同样重要的是，我们发现一种更简单的嵌入技术--词频-反向文档频率（TFIDF）--的性能接近于 GPT。由于 TFIDF 可以捕捉到论文的主题，因此我们推断，除了考虑作者和机构的偏差之外，还可以通过识别主题和 "阅读 "论文摘要来预测所考虑期刊的引用次数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量