Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases

IF 2.8 2区 管理学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Lorena Delgado-Quirós, Isidro F. Aguillo, Alberto Martín-Martín, Emilio Delgado López-Cózar, Enrique Orduña-Malea, José Luis Ortega
{"title":"Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases","authors":"Lorena Delgado-Quirós,&nbsp;Isidro F. Aguillo,&nbsp;Alberto Martín-Martín,&nbsp;Emilio Delgado López-Cózar,&nbsp;Enrique Orduña-Malea,&nbsp;José Luis Ortega","doi":"10.1002/asi.24839","DOIUrl":null,"url":null,"abstract":"<p>This study analyses the coverage of seven free-access bibliographic databases (Crossref, Dimensions—non-subscription version, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) to identify the potential reasons that might cause the exclusion of scholarly documents and how they could influence coverage. To do this, 116 k randomly selected bibliographic records from Crossref were used as a baseline. API endpoints and web scraping were used to query each database. The results show that coverage differences are mainly caused by the way each service builds their databases. While classic bibliographic databases ingest almost the exact same content from Crossref (Lens and Scilit miss 0.1% and 0.2% of the records, respectively), academic search engines present lower coverage (Google Scholar does not find: 9.8%, Semantic Scholar: 10%, and Microsoft Academic: 12%). Coverage differences are mainly attributed to external factors, such as web accessibility and robot exclusion policies (39.2%–46%), and internal requirements that exclude secondary content (6.5%–11.6%). In the case of Dimensions, the only classic bibliographic database with the lowest coverage (7.6%), internal selection criteria such as the indexation of full books instead of book chapters (65%) and the exclusion of secondary content (15%) are the main motives of missing publications.</p>","PeriodicalId":48810,"journal":{"name":"Journal of the Association for Information Science and Technology","volume":null,"pages":null},"PeriodicalIF":2.8000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://asistdl.onlinelibrary.wiley.com/doi/epdf/10.1002/asi.24839","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Association for Information Science and Technology","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/asi.24839","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This study analyses the coverage of seven free-access bibliographic databases (Crossref, Dimensions—non-subscription version, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) to identify the potential reasons that might cause the exclusion of scholarly documents and how they could influence coverage. To do this, 116 k randomly selected bibliographic records from Crossref were used as a baseline. API endpoints and web scraping were used to query each database. The results show that coverage differences are mainly caused by the way each service builds their databases. While classic bibliographic databases ingest almost the exact same content from Crossref (Lens and Scilit miss 0.1% and 0.2% of the records, respectively), academic search engines present lower coverage (Google Scholar does not find: 9.8%, Semantic Scholar: 10%, and Microsoft Academic: 12%). Coverage differences are mainly attributed to external factors, such as web accessibility and robot exclusion policies (39.2%–46%), and internal requirements that exclude secondary content (6.5%–11.6%). In the case of Dimensions, the only classic bibliographic database with the lowest coverage (7.6%), internal selection criteria such as the indexation of full books instead of book chapters (65%) and the exclusion of secondary content (15%) are the main motives of missing publications.

Abstract Image

为什么这些出版物不见了?揭示免费学术数据库排除文献的原因
本研究分析了七个免费书目数据库(Crossref, dimensions -非订阅版,Google Scholar, Lens, Microsoft Academic, Scilit和Semantic Scholar)的覆盖范围,以确定可能导致学术文献被排除的潜在原因以及它们如何影响覆盖范围。为了做到这一点,从Crossref中随机选择了116 k个书目记录作为基线。使用API端点和web抓取来查询每个数据库。结果表明,覆盖差异主要是由各服务建立数据库的方式造成的。虽然经典书目数据库从Crossref中摄取几乎完全相同的内容(Lens和Scilit分别丢失了0.1%和0.2%的记录),但学术搜索引擎的覆盖率较低(Google Scholar没有找到:9.8%,Semantic Scholar: 10%, Microsoft academic: 12%)。覆盖率差异主要归因于外部因素,如网页可访问性和机器人排除政策(39.2%-46%),以及排除次要内容的内部要求(6.5%-11.6%)。Dimensions是唯一覆盖率最低的经典书目数据库(7.6%),内部选择标准,如索引全文而不是书籍章节(65%)和排除次要内容(15%)是出版物缺失的主要动机。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.30
自引率
8.60%
发文量
115
期刊介绍: The Journal of the Association for Information Science and Technology (JASIST) is a leading international forum for peer-reviewed research in information science. For more than half a century, JASIST has provided intellectual leadership by publishing original research that focuses on the production, discovery, recording, storage, representation, retrieval, presentation, manipulation, dissemination, use, and evaluation of information and on the tools and techniques associated with these processes. The Journal welcomes rigorous work of an empirical, experimental, ethnographic, conceptual, historical, socio-technical, policy-analytic, or critical-theoretical nature. JASIST also commissions in-depth review articles (“Advances in Information Science”) and reviews of print and other media.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信