用于图像分类的压缩域视觉变换器

IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Ruolei Ji;Lina J. Karam
{"title":"用于图像分类的压缩域视觉变换器","authors":"Ruolei Ji;Lina J. Karam","doi":"10.1109/JETCAS.2024.3394878","DOIUrl":null,"url":null,"abstract":"Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Compressed-Domain Vision Transformer for Image Classification\",\"authors\":\"Ruolei Ji;Lina J. Karam\",\"doi\":\"10.1109/JETCAS.2024.3394878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-04-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10510316/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10510316/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

在压缩域视觉任务方案中,视觉处理或计算机视觉直接在压缩域表征上执行,通过避免对压缩视觉信息进行解码,在训练和部署过程中实现了更高的计算效率,同时与相应的空间域视觉任务相比,具有竞争力甚至更好的性能。这项工作关注的是基于学习的压缩域图像分类,即直接在压缩域表征(也称为潜在表征)上执行图像分类,这些表征是使用基于学习的视觉编码器获得的。本文提出了一种压缩域视觉变换器(cViT),用于在基于学习的压缩域中执行图像分类。为此,本文采用并修改了视觉变换器(ViT)架构,以便直接在压缩域中执行分类。作为这项工作的一部分,我们引入了一种新颖的特征补丁嵌入方法,利用压缩域中的内部和跨通道信息。此外,还设计了一种适应性训练策略,采用预先训练好的空间域 ViT 的权重,并将其适应于压缩域分类任务。此外,预训练的 ViT 权重通过插值法用于位置嵌入初始化,以进一步提高 cViT 的性能。实验结果表明,所提出的 cViT 在分类精度 Top-1 和 Top-5 方面优于现有的压缩域分类网络。此外,与像素域方法相比,所提出的 cViT 能以更高的计算效率获得有竞争力的分类精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Compressed-Domain Vision Transformer for Image Classification
Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
8.50
自引率
2.20%
发文量
86
期刊介绍: The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信