双重麻烦？人脸图像数据集中重复图像的影响与检测

International Conference on Pattern Recognition Applications and Methods Pub Date : 2024-01-25 DOI:10.5220/0012422500003654

Torsten Schlett, C. Rathgeb, Juan E. Tapia, Christoph Busch

{"title":"双重麻烦？人脸图像数据集中重复图像的影响与检测","authors":"Torsten Schlett, C. Rathgeb, Juan E. Tapia, Christoph Busch","doi":"10.5220/0012422500003654","DOIUrl":null,"url":null,"abstract":"Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.","PeriodicalId":410036,"journal":{"name":"International Conference on Pattern Recognition Applications and Methods","volume":"68 12","pages":"801-808"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Double Trouble? Impact and Detection of Duplicates in Face Image Datasets\",\"authors\":\"Torsten Schlett, C. Rathgeb, Juan E. Tapia, Christoph Busch\",\"doi\":\"10.5220/0012422500003654\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.\",\"PeriodicalId\":410036,\"journal\":{\"name\":\"International Conference on Pattern Recognition Applications and Methods\",\"volume\":\"68 12\",\"pages\":\"801-808\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Pattern Recognition Applications and Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0012422500003654\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Pattern Recognition Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0012422500003654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

用于人脸生物识别研究的各种人脸图像数据集是通过网络抓取（即收集互联网上公开的图像）创建的。这项研究提出了一种利用文件和图像哈希值检测完全相同和几乎完全相同的重复人脸图像的方法。该方法通过使用人脸图像预处理进行扩展。基于人脸识别和人脸图像质量评估模型的附加步骤可减少误报，并有助于重复删除主体内和主体间重复集的人脸图像。所介绍的方法适用于五个数据集，即 LFW、TinyFace、Adience、CASIA-WebFace 和 C-MS-Celeb（经过清理的 MS-Celeb-1M 变体）。每个数据集中都能检测到重复数据，除 LFW 外，其他数据集中都有数百至数十万个重复数据。人脸识别和质量评估实验表明，重复删除对结果的影响很小。重复数据删除的最终结果已公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Double Trouble? Impact and Detection of Duplicates in Face Image Datasets

Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Pattern Recognition Applications and Methods

自引率

0.00%

发文量