Comparison of text-based and linked-based metrics in terms of estimating the similarity of articles

IF 1.4 4区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE
M. Goltaji, Javad Abbaspour, A. Jowkar, S. M. Fakhrahmad
{"title":"Comparison of text-based and linked-based metrics in terms of estimating the similarity of articles","authors":"M. Goltaji, Javad Abbaspour, A. Jowkar, S. M. Fakhrahmad","doi":"10.1177/09610006231165759","DOIUrl":null,"url":null,"abstract":"The aim of this study is to identify the power of text-based metrics (Cosine and Lucene similarity) and linked-based (Co-citation, bibliographic coupling, Amsler, PageRank, and HITS) and their combination in estimating the similarity of articles with each other. The experiments were conducted on a test collection of 26,262 articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC that, in addition to having linked-based metrics, their full text was available for calculating text-based metrics. Thirty articles were selected as primary articles, and articles related to each of them were retrieved based on the mesh similarity metric. Then, the similarity of the retrieved documents based on text-based and linked-based metrics was also extracted. In the next stage, text-based, linked-based, and hybrid metrics were entered into the generalized regression model to estimate the similarity of the articles to determine their power; finally, the performance of the models was compared based on the mean squared error and correlation. The results showed that the model included Cosine and Lucene similarity metrics in text-based metrics. In linked-based metrics, HITS (Hub), HITS (authority), PageRank, and co-citation had the highest power, respectively; but the bibliographic coupling and Amsler could not enter the model. In general, a comparison of text-based, linked-based, and hybrid metrics performance indicated that the linked-based model estimates similarity between articles better than the text-based model, and the combination of text-based and linked-based metrics makes little change in improving the power of the articles. Despite the importance and application of text-based and linked-based metrics to measure the similarity of articles, a study that examines the power of these metrics alone and in comparison with each other in estimating the similarity of articles was not observed.","PeriodicalId":47004,"journal":{"name":"Journal of Librarianship and Information Science","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Librarianship and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1177/09610006231165759","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study is to identify the power of text-based metrics (Cosine and Lucene similarity) and linked-based (Co-citation, bibliographic coupling, Amsler, PageRank, and HITS) and their combination in estimating the similarity of articles with each other. The experiments were conducted on a test collection of 26,262 articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC that, in addition to having linked-based metrics, their full text was available for calculating text-based metrics. Thirty articles were selected as primary articles, and articles related to each of them were retrieved based on the mesh similarity metric. Then, the similarity of the retrieved documents based on text-based and linked-based metrics was also extracted. In the next stage, text-based, linked-based, and hybrid metrics were entered into the generalized regression model to estimate the similarity of the articles to determine their power; finally, the performance of the models was compared based on the mean squared error and correlation. The results showed that the model included Cosine and Lucene similarity metrics in text-based metrics. In linked-based metrics, HITS (Hub), HITS (authority), PageRank, and co-citation had the highest power, respectively; but the bibliographic coupling and Amsler could not enter the model. In general, a comparison of text-based, linked-based, and hybrid metrics performance indicated that the linked-based model estimates similarity between articles better than the text-based model, and the combination of text-based and linked-based metrics makes little change in improving the power of the articles. Despite the importance and application of text-based and linked-based metrics to measure the similarity of articles, a study that examines the power of these metrics alone and in comparison with each other in estimating the similarity of articles was not observed.
基于文本和基于链接的指标在估计文章相似性方面的比较
本研究的目的是确定基于文本的度量(Cosine和Lucene相似性)和基于链接的度量(共引、书目耦合、Amsler、PageRank和HITS)及其组合在估计文章彼此相似性方面的能力。实验是在CITREC的PubMed Central Open Access Subset(PMC OAS)中的26262篇文章的测试集上进行的,这些文章除了具有基于链接的度量外,它们的全文还可用于计算基于文本的度量。选择30篇文章作为主要文章,并基于网格相似性度量检索与每一篇文章相关的文章。然后,还提取了基于文本和基于链接的度量的检索文档的相似性。在下一阶段,将基于文本、基于链接和混合的度量输入到广义回归模型中,以估计文章的相似性,从而确定其功效;最后,基于均方误差和相关性对模型的性能进行了比较。结果表明,该模型在基于文本的度量中包含了余弦和Lucene相似性度量。在基于链接的度量中,HITS(Hub)、HITS(authority)、PageRank和共引分别具有最高的幂;但书目耦合和Amsler不能进入模型。通常,基于文本、基于链接和混合度量性能的比较表明,基于链接的模型比基于文本的模型更好地估计文章之间的相似性,并且基于文本和基于链接的度量的组合在提高文章的能力方面几乎没有变化。尽管基于文本和基于链接的度量在衡量文章相似性方面具有重要意义和应用,但没有观察到一项单独检查这些度量在估计文章相似性时的能力以及相互比较的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Librarianship and Information Science
Journal of Librarianship and Information Science INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
4.70
自引率
11.80%
发文量
82
期刊介绍: Journal of Librarianship and Information Science is the peer-reviewed international quarterly journal for librarians, information scientists, specialists, managers and educators interested in keeping up to date with the most recent issues and developments in the field. The Journal provides a forumfor the publication of research and practical developments as well as for discussion papers and viewpoints on topical concerns in a profession facing many challenges.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信