Evaluating link-based recommendations for Wikipedia

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL) Pub Date : 2016-06-19 DOI:10.1145/2910896.2910908

M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp

{"title":"Evaluating link-based recommendations for Wikipedia","authors":"M. Schwarzer, M. Schubotz, Norman Meuschke, Corinna Breitinger, V. Markl, Bela Gipp","doi":"10.1145/2910896.2910908","DOIUrl":null,"url":null,"abstract":"Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2910896.2910908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's “See also” sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation-based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics.

查看原文本刊更多论文

评估维基百科基于链接的推荐

文献推荐系统支持用户过滤数字图书馆和网络上大量且数量不断增加的文献。对于学术文献，研究已经证明了基于引文的文档相似度度量，如共引(CoCit)或共引接近分析(CPA)可以提高推荐质量。在本文中，我们报告了CPA方法在为维基百科生成文献推荐中的性能的第一次大规模调查，这与学术文献领域有着根本的不同。我们分析链接而不是引用来生成文章推荐。我们评估了CPA、CoCit和Apache Lucene MoreLikeThis (MLT)函数，它代表了传统的基于文本的相似性度量。我们使用了两个数据集，分别为779,716和257万篇维基百科文章，大数据处理框架Apache Flink和一个十节点计算集群。为了实现我们的大规模评估，我们从维基百科“See also”部分的链接和一个全面的维基百科点击流数据集中得出了两个准黄金标准。我们的研究结果表明，与基于文本的MLT测量相比，基于引文的CPA和CoCit具有互补的优势。虽然MLT在识别具有相似词汇和结构的狭义相似文章方面表现良好，但基于引用的度量能够更好地识别主题相关信息，例如某所大学或该地区其他技术大学的城市信息。CPA方法一直优于CoCit，它更适合于识别更广泛的相关文章，以及通常表现出更高质量的流行文章。CPA方法的其他优点是其较低的运行时需求和语言独立性，允许跨语言检索文章。我们提出了示范性文章的手工分析，以展示和讨论我们的发现。我们研究的原始数据和源代码，以及如何使用它们的手册，都可以在https://github.com/wikimedia/citolytics上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量