S-ViT: Sparse Vision Transformer for Accurate Face Recognition

IF 0.4 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Applied Computing Review Pub Date : 2023-03-27 DOI:10.1145/3555776.3577640

Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo

{"title":"S-ViT: Sparse Vision Transformer for Accurate Face Recognition","authors":"Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo","doi":"10.1145/3555776.3577640","DOIUrl":null,"url":null,"abstract":"Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.","PeriodicalId":42971,"journal":{"name":"Applied Computing Review","volume":"18 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555776.3577640","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.

查看原文本刊更多论文

S-ViT:用于精确人脸识别的稀疏视觉变换

大多数使用深度学习模型的现有人脸识别应用都利用基于cnn的架构作为特征提取器。然而，最近的研究表明，在计算机视觉任务中，基于视觉变换的模型往往优于基于cnn的模型。因此，在这项工作中，我们提出了一种基于视觉转换器(ViT)架构的稀疏视觉转换器(S-ViT)来改进人脸识别任务。经过模型训练后，S-ViT相对于ViT的权值分布趋于稀疏，所以我们根据这些特征来命名它。与传统的ViT不同，本文提出的S-ViT采用图像相对位置编码(iRPE)方法进行位置编码。此外，S-ViT已被修改，以便所有令牌嵌入，而不仅仅是类令牌，参与解码过程。通过大量的实验，我们发现S-ViT在闭集中的性能优于其他基线模型，并且优于基于基线vit的模型。例如，在识别协议中使用ArcFace作为损失函数时，S-ViT的准确率比ResNet50高出3.27%。我们还表明，使用ArcFace损失函数在S-ViT中比在基线模型中产生更大的性能收益。此外，S-ViT在成本-性能权衡方面具有优势，因为它比底层模型ViT对剪枝技术更健壮。因此，S-ViT提供了额外的优势，可以更灵活地应用于资源有限的目标设备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Computing Review COMPUTER SCIENCE, INFORMATION SYSTEMS-

自引率

40.00%

发文量