Using Hashing to Improve Efficiency in Cross-image Duplicate Detection in Research Publications

2021 IEEE Integrated STEM Education Conference (ISEC) Pub Date : 2021-03-13 DOI:10.1109/ISEC52395.2021.9763956

Tong-suo Lu

{"title":"Using Hashing to Improve Efficiency in Cross-image Duplicate Detection in Research Publications","authors":"Tong-suo Lu","doi":"10.1109/ISEC52395.2021.9763956","DOIUrl":null,"url":null,"abstract":"Cases of research misconduct had increasingly exhibited themselvesthrough the duplicate figures that they contain; Bik et al. [1] examined over 20 thousand biomedical published papers and found that 3.8% had inappropriate duplicate figures, with this percentage on the rise in recent years. Currently, the identification of Figure duplicates is mainly carried out by human reviewers; the process is slow and requires specialized training. There have been attempts to develop large-scale screening tools for image duplicates, but they are either unpublished [2] or do not perform very well. There exists prior research in the field of copy-move forgery detection. These deal with duplicate regions on a single image, but the methods could be modified and applied to cross-image matching, as we intend to. However, cross-image matching implies a much larger feature set to match between, and feature matching is currently the slowest step in the process [3]. Currently, there are two directions to address this problem. One is to use keypoint-based features, such as SIFT, to decrease the size of the feature set. The other is to apply hashing to the features and use hash lookup to quickly eliminate those features that definitely don’t match; Bayram et al. [4] demonstrates that using bloom filters in place of traditional methods increased the matching speed at some loss of result accuracy. We plan to devise a method that applies hashing to matching SIFT features in order to reliably perform faster than prior methods on cross-image matching in large biomedical image sets. We expect the resulting method to perform faster than current methods with little to no loss of accuracy.","PeriodicalId":329844,"journal":{"name":"2021 IEEE Integrated STEM Education Conference (ISEC)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Integrated STEM Education Conference (ISEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISEC52395.2021.9763956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cases of research misconduct had increasingly exhibited themselvesthrough the duplicate figures that they contain; Bik et al. [1] examined over 20 thousand biomedical published papers and found that 3.8% had inappropriate duplicate figures, with this percentage on the rise in recent years. Currently, the identification of Figure duplicates is mainly carried out by human reviewers; the process is slow and requires specialized training. There have been attempts to develop large-scale screening tools for image duplicates, but they are either unpublished [2] or do not perform very well. There exists prior research in the field of copy-move forgery detection. These deal with duplicate regions on a single image, but the methods could be modified and applied to cross-image matching, as we intend to. However, cross-image matching implies a much larger feature set to match between, and feature matching is currently the slowest step in the process [3]. Currently, there are two directions to address this problem. One is to use keypoint-based features, such as SIFT, to decrease the size of the feature set. The other is to apply hashing to the features and use hash lookup to quickly eliminate those features that definitely don’t match; Bayram et al. [4] demonstrates that using bloom filters in place of traditional methods increased the matching speed at some loss of result accuracy. We plan to devise a method that applies hashing to matching SIFT features in order to reliably perform faster than prior methods on cross-image matching in large biomedical image sets. We expect the resulting method to perform faster than current methods with little to no loss of accuracy.

查看原文本刊更多论文

利用哈希提高论文交叉图像重复检测效率

研究不端行为的案例越来越多地通过它们所包含的重复数据暴露出来;Bik等人研究了2万多篇已发表的生物医学论文，发现3.8%的论文存在不恰当的重复数据，这一比例近年来呈上升趋势。目前，图重复的识别主要由人工审稿人进行;这个过程很慢，需要专门的培训。已经有人尝试开发用于图像重复的大规模筛选工具，但它们要么没有发表，要么表现不太好。在复制-移动伪造检测领域已有较好的研究成果。这些方法处理的是单个图像上的重复区域，但是这些方法可以修改并应用于跨图像匹配，正如我们所打算的那样。然而，交叉图像匹配意味着更大的特征集之间的匹配，和特征匹配目前是最慢的步骤。目前，解决这个问题有两个方向。一种是使用基于关键点的特征(如SIFT)来减小特征集的大小。另一种是对特征应用哈希，并使用哈希查找来快速消除那些绝对不匹配的特征;Bayram等人证明，使用布隆过滤器代替传统方法可以在一定程度上降低结果准确性的情况下提高匹配速度。我们计划设计一种将哈希法应用于SIFT特征匹配的方法，以便在大型生物医学图像集的交叉图像匹配中比以前的方法更快地可靠地执行。我们期望所得到的方法比目前的方法执行得更快，而且准确度几乎没有损失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Integrated STEM Education Conference (ISEC)

自引率

0.00%

发文量