{"title":"Using Hashing to Improve Efficiency in Cross-image Duplicate Detection in Research Publications","authors":"Tong-suo Lu","doi":"10.1109/ISEC52395.2021.9763956","DOIUrl":null,"url":null,"abstract":"Cases of research misconduct had increasingly exhibited themselvesthrough the duplicate figures that they contain; Bik et al. [1] examined over 20 thousand biomedical published papers and found that 3.8% had inappropriate duplicate figures, with this percentage on the rise in recent years. Currently, the identification of Figure duplicates is mainly carried out by human reviewers; the process is slow and requires specialized training. There have been attempts to develop large-scale screening tools for image duplicates, but they are either unpublished [2] or do not perform very well. There exists prior research in the field of copy-move forgery detection. These deal with duplicate regions on a single image, but the methods could be modified and applied to cross-image matching, as we intend to. However, cross-image matching implies a much larger feature set to match between, and feature matching is currently the slowest step in the process [3]. Currently, there are two directions to address this problem. One is to use keypoint-based features, such as SIFT, to decrease the size of the feature set. The other is to apply hashing to the features and use hash lookup to quickly eliminate those features that definitely don’t match; Bayram et al. [4] demonstrates that using bloom filters in place of traditional methods increased the matching speed at some loss of result accuracy. We plan to devise a method that applies hashing to matching SIFT features in order to reliably perform faster than prior methods on cross-image matching in large biomedical image sets. We expect the resulting method to perform faster than current methods with little to no loss of accuracy.","PeriodicalId":329844,"journal":{"name":"2021 IEEE Integrated STEM Education Conference (ISEC)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Integrated STEM Education Conference (ISEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISEC52395.2021.9763956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cases of research misconduct had increasingly exhibited themselvesthrough the duplicate figures that they contain; Bik et al. [1] examined over 20 thousand biomedical published papers and found that 3.8% had inappropriate duplicate figures, with this percentage on the rise in recent years. Currently, the identification of Figure duplicates is mainly carried out by human reviewers; the process is slow and requires specialized training. There have been attempts to develop large-scale screening tools for image duplicates, but they are either unpublished [2] or do not perform very well. There exists prior research in the field of copy-move forgery detection. These deal with duplicate regions on a single image, but the methods could be modified and applied to cross-image matching, as we intend to. However, cross-image matching implies a much larger feature set to match between, and feature matching is currently the slowest step in the process [3]. Currently, there are two directions to address this problem. One is to use keypoint-based features, such as SIFT, to decrease the size of the feature set. The other is to apply hashing to the features and use hash lookup to quickly eliminate those features that definitely don’t match; Bayram et al. [4] demonstrates that using bloom filters in place of traditional methods increased the matching speed at some loss of result accuracy. We plan to devise a method that applies hashing to matching SIFT features in order to reliably perform faster than prior methods on cross-image matching in large biomedical image sets. We expect the resulting method to perform faster than current methods with little to no loss of accuracy.