{"title":"Compressed-Domain Vision Transformer for Image Classification","authors":"Ruolei Ji;Lina J. Karam","doi":"10.1109/JETCAS.2024.3394878","DOIUrl":null,"url":null,"abstract":"Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"299-310"},"PeriodicalIF":3.7000,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10510316/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Compressed-domain visual task schemes, where visual processing or computer vision are directly performed on the compressed-domain representations, were shown to achieve a higher computational efficiency during training and deployment by avoiding the need to decode the compressed visual information while resulting in a competitive or even better performance as compared to corresponding spatial-domain visual tasks. This work is concerned with learning-based compressed-domain image classification, where the image classification is performed directly on compressed-domain representations, also known as latent representations, that are obtained using a learning-based visual encoder. In this paper, a compressed-domain Vision Transformer (cViT) is proposed to perform image classification in the learning-based compressed-domain. For this purpose, the Vision Transformer (ViT) architecture is adopted and modified to perform classification directly in the compressed-domain. As part of this work, a novel feature patch embedding is introduced leveraging the within- and cross-channel information in the compressed-domain. Also, an adaptation training strategy is designed to adopt the weights from the pre-trained spatial-domain ViT and adapt these to the compressed-domain classification task. Furthermore, the pre-trained ViT weights are utilized through interpolation for position embedding initialization to further improve the performance of cViT. The experimental results show that the proposed cViT outperforms the existing compressed-domain classification networks in terms of Top-1 and Top-5 classification accuracies. Moreover, the proposed cViT can yield competitive classification accuracies with a significantly higher computational efficiency as compared to pixel-domain approaches.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.