Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière
{"title":"Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness","authors":"Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière","doi":"arxiv-2409.10633","DOIUrl":null,"url":null,"abstract":"Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades\nthe main sources of bibliometric information. Although highly curated, these\nclosed, proprietary databases are largely biased towards English-language\npublications, underestimating the use of other languages in research\ndissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,\nand open-source research information. While already in use by scholars and\nresearch institutions, the quality of its metadata is currently being assessed.\nThis paper contributes to this literature by assessing the completeness and\naccuracy of its metadata related to language, through a comparison with WoS, as\nwell as an in-depth manual validation of a sample of 6,836 articles. Results\nshow that OpenAlex exhibits a far more balanced linguistic coverage than WoS.\nHowever, language metadata is not always accurate, which leads OpenAlex to\noverestimate the place of English while underestimating that of other\nlanguages. If used critically, OpenAlex can provide comprehensive and\nrepresentative analyses of languages used for scholarly publishing. However,\nmore work is needed at infrastructural level to ensure the quality of metadata\non language.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades
the main sources of bibliometric information. Although highly curated, these
closed, proprietary databases are largely biased towards English-language
publications, underestimating the use of other languages in research
dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,
and open-source research information. While already in use by scholars and
research institutions, the quality of its metadata is currently being assessed.
This paper contributes to this literature by assessing the completeness and
accuracy of its metadata related to language, through a comparison with WoS, as
well as an in-depth manual validation of a sample of 6,836 articles. Results
show that OpenAlex exhibits a far more balanced linguistic coverage than WoS.
However, language metadata is not always accurate, which leads OpenAlex to
overestimate the place of English while underestimating that of other
languages. If used critically, OpenAlex can provide comprehensive and
representative analyses of languages used for scholarly publishing. However,
more work is needed at infrastructural level to ensure the quality of metadata
on language.