{"title":"De-duplicating a large crowd-sourced catalogue of bibliographic records","authors":"Ilija Subasic, N. Gvozdenovic, Kris Jack","doi":"10.1108/PROG-02-2015-0021","DOIUrl":null,"url":null,"abstract":"Purpose – The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm. Design/methodology/approach – The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm. Findings – The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, a...","PeriodicalId":49663,"journal":{"name":"Program-Electronic Library and Information Systems","volume":"50 1","pages":"138-156"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1108/PROG-02-2015-0021","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Program-Electronic Library and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/PROG-02-2015-0021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose – The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm. Design/methodology/approach – The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm. Findings – The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, a...
期刊介绍:
■Automation of library and information services ■Storage and retrieval of all forms of electronic information ■Delivery of information to end users ■Database design and management ■Techniques for storing and distributing information ■Networking and communications technology ■The Internet ■User interface design ■Procurement of systems ■User training and support ■System evaluation