S-ViT: Sparse Vision Transformer for Accurate Face Recognition

IF 0.4 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS
Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo
{"title":"S-ViT: Sparse Vision Transformer for Accurate Face Recognition","authors":"Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo","doi":"10.1145/3555776.3577640","DOIUrl":null,"url":null,"abstract":"Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.","PeriodicalId":42971,"journal":{"name":"Applied Computing Review","volume":"18 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555776.3577640","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.
S-ViT:用于精确人脸识别的稀疏视觉变换
大多数使用深度学习模型的现有人脸识别应用都利用基于cnn的架构作为特征提取器。然而,最近的研究表明,在计算机视觉任务中,基于视觉变换的模型往往优于基于cnn的模型。因此,在这项工作中,我们提出了一种基于视觉转换器(ViT)架构的稀疏视觉转换器(S-ViT)来改进人脸识别任务。经过模型训练后,S-ViT相对于ViT的权值分布趋于稀疏,所以我们根据这些特征来命名它。与传统的ViT不同,本文提出的S-ViT采用图像相对位置编码(iRPE)方法进行位置编码。此外,S-ViT已被修改,以便所有令牌嵌入,而不仅仅是类令牌,参与解码过程。通过大量的实验,我们发现S-ViT在闭集中的性能优于其他基线模型,并且优于基于基线vit的模型。例如,在识别协议中使用ArcFace作为损失函数时,S-ViT的准确率比ResNet50高出3.27%。我们还表明,使用ArcFace损失函数在S-ViT中比在基线模型中产生更大的性能收益。此外,S-ViT在成本-性能权衡方面具有优势,因为它比底层模型ViT对剪枝技术更健壮。因此,S-ViT提供了额外的优势,可以更灵活地应用于资源有限的目标设备。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Computing Review
Applied Computing Review COMPUTER SCIENCE, INFORMATION SYSTEMS-
自引率
40.00%
发文量
8
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信