De-duplicating a large crowd-sourced catalogue of bibliographic records

Q Social Sciences
Ilija Subasic, N. Gvozdenovic, Kris Jack
{"title":"De-duplicating a large crowd-sourced catalogue of bibliographic records","authors":"Ilija Subasic, N. Gvozdenovic, Kris Jack","doi":"10.1108/PROG-02-2015-0021","DOIUrl":null,"url":null,"abstract":"Purpose – The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm. Design/methodology/approach – The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm. Findings – The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, a...","PeriodicalId":49663,"journal":{"name":"Program-Electronic Library and Information Systems","volume":"50 1","pages":"138-156"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1108/PROG-02-2015-0021","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Program-Electronic Library and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/PROG-02-2015-0021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose – The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm. Design/methodology/approach – The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm. Findings – The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, a...
从大量的文献记录中删除重复的目录
目的-本文的目的是描述一种大规模算法,用于从众包数据中生成科学出版记录(引用)目录,演示如何学习用于重复检测的距离度量的最佳组合,并引入并行重复聚类算法。设计/方法论/方法-作者开发了算法,并将其与解决相同问题的最先进系统进行了比较。作者使用基准数据集(3k个数据点)来测试我们算法的有效性,并使用实际数据(> 9000万)来测试我们算法的效率和可扩展性。发现-作者表明,重复检测可以通过我们称之为重复聚类的额外步骤来改进。作者还介绍了如何通过引入采样步骤来提高map/reduce相似度计算算法的效率。最后,作者发现该系统可与最先进的重复检测系统相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Program-Electronic Library and Information Systems
Program-Electronic Library and Information Systems 工程技术-计算机:信息系统
CiteScore
1.30
自引率
0.00%
发文量
0
审稿时长
>12 weeks
期刊介绍: ■Automation of library and information services ■Storage and retrieval of all forms of electronic information ■Delivery of information to end users ■Database design and management ■Techniques for storing and distributing information ■Networking and communications technology ■The Internet ■User interface design ■Procurement of systems ■User training and support ■System evaluation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信