{"title":"Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia","authors":"Natallia Kokash, Giovanni Colavizza","doi":"arxiv-2406.19291","DOIUrl":null,"url":null,"abstract":"Wikipedia is an essential component of the open science ecosystem, yet it is\npoorly integrated with academic open science initiatives. Wikipedia Citations\nis a project that focuses on extracting and releasing comprehensive datasets of\ncitations from Wikipedia. A total of 29.3 million citations were extracted from\nEnglish Wikipedia in May 2020. Following this one-off research project, we\ndesigned a reproducible pipeline that can process any given Wikipedia dump in\nthe cloud-based settings. To demonstrate its usability, we extracted 40.6\nmillion citations in February 2023 and 44.7 million citations in February 2024.\nFurthermore, we equipped the pipeline with an adapted Wikipedia citation\ntemplate translation module to process multilingual Wikipedia articles in 15\nEuropean languages so that they are parsed and mapped into a generic structured\ncitation template. This paper presents our open-source software pipeline to\nretrieve, classify, and disambiguate citations on demand from a given Wikipedia\ndump.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.19291","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Wikipedia is an essential component of the open science ecosystem, yet it is
poorly integrated with academic open science initiatives. Wikipedia Citations
is a project that focuses on extracting and releasing comprehensive datasets of
citations from Wikipedia. A total of 29.3 million citations were extracted from
English Wikipedia in May 2020. Following this one-off research project, we
designed a reproducible pipeline that can process any given Wikipedia dump in
the cloud-based settings. To demonstrate its usability, we extracted 40.6
million citations in February 2023 and 44.7 million citations in February 2024.
Furthermore, we equipped the pipeline with an adapted Wikipedia citation
template translation module to process multilingual Wikipedia articles in 15
European languages so that they are parsed and mapped into a generic structured
citation template. This paper presents our open-source software pipeline to
retrieve, classify, and disambiguate citations on demand from a given Wikipedia
dump.