{"title":"GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition","authors":"Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang","doi":"10.1109/TMM.2024.3521658","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521658","url":null,"abstract":"Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, <italic>aka</i> GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP<inline-formula><tex-math>$spadesuit$</tex-math></inline-formula>) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP<inline-formula><tex-math>$clubsuit$</tex-math></inline-formula>) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+9.4}}$</tex-math></inline-formula>), EGTEA (39.6%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+5.5}}$</tex-math></inline-formula>), and CharadesEgo (31.5%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+2.6}}$</tex-math></inline-formula>). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"401-413"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiview Feature Decoupling for Deep Subspace Clustering","authors":"Yuxiu Lin;Hui Liu;Ren Wang;Qiang Guo;Caiming Zhang","doi":"10.1109/TMM.2024.3521776","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521776","url":null,"abstract":"Deep multi-view subspace clustering aims to reveal a common subspace structure by exploiting rich multi-view information. Despite promising progress, current methods focus only on multi-view consistency and complementarity, often overlooking the adverse influence of entangled superfluous information in features. Moreover, most existing works lack scalability and are inefficient for large-scale scenarios. To this end, we innovatively propose a deep subspace clustering method via Multi-view Feature Decoupling (MvFD). First, MvFD incorporates well-designed multi-type auto-encoders with self-supervised learning, explicitly decoupling consistent, complementary, and superfluous features for every view. The disentangled and interpretable feature space can then better serve unified representation learning. By integrating these three types of information within a unified framework, we employ information theory to obtain a minimal and sufficient representation with high discriminability. Besides, we introduce a deep metric network to model self-expression correlation more efficiently, where network parameters remain unaffected by changes in sample numbers. Extensive experiments show that MvFD yields State-of-the-Art performance in various types of multi-view datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"544-556"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengzan Qi;Sixian Chan;Chen Hang;Guixu Zhang;Tieyong Zeng;Zhi Li
{"title":"Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification","authors":"Mengzan Qi;Sixian Chan;Chen Hang;Guixu Zhang;Tieyong Zeng;Zhi Li","doi":"10.1109/TMM.2024.3521773","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521773","url":null,"abstract":"Visible-Infrared Person Re-identification aims to retrieve images of specific identities across modalities. To relieve the large cross-modality discrepancy, researchers introduce the auxiliary modality within the image space to assist modality-invariant representation learning. However, the challenge persists in constraining the inherent quality of generated auxiliary images, further leading to a bottleneck in retrieval performance. In this paper, we propose a novel Auxiliary Representation Guided Network (ARGN) to explore the potential of auxiliary representations, which are directly generated within the modality-shared embedding space. In contrast to the original visible and infrared representations, which contain information solely from their respective modalities, these auxiliary representations integrate cross-modality information by fusing both modalities. In our framework, we utilize these auxiliary representations as modality guidance to reduce the cross-modality discrepancy. First, we propose a High-quality Auxiliary Representation Learning (HARL) framework to generate identity-consistent auxiliary representations. The primary objective of our HARL is to ensure that auxiliary representations capture diverse modality information from both modalities while concurrently preserving identity-related discrimination. Second, guided by these auxiliary representations, we design an Auxiliary Representation Guided Constraint (ARGC) to optimize the modality-shared embedding space. By incorporating this constraint, the modality-shared embedding space is optimized to achieve enhanced intra-identity compactness and inter-identity separability, further improving the retrieval performance. In addition, to improve the robustness of our framework against the modality variation, we introduce a Part-based Adaptive Gaussian Module (PAGM) to adaptively extract discriminative information across modalities. Finally, extensive experiments are conducted to demonstrate the superiority of our method over state-of-the-art approaches on three VI-ReID datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"340-355"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection","authors":"Xu Han;Junyu Gao;Chuang Yang;Yuan Yuan;Qi Wang","doi":"10.1109/TMM.2024.3521797","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521797","url":null,"abstract":"Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"287-299"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Category-Level Multi-Object 9D State Tracking Using Object-Centric Multi-Scale Transformer in Point Cloud Stream","authors":"Jingtao Sun;Yaonan Wang;Mingtao Feng;Xiaofeng Guo;Huimin Lu;Xieyuanli Chen","doi":"10.1109/TMM.2024.3521664","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521664","url":null,"abstract":"Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1072-1085"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang
{"title":"Video Instance Segmentation Without Using Mask and Identity Supervision","authors":"Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang","doi":"10.1109/TMM.2024.3521668","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521668","url":null,"abstract":"Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"224-235"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Hu;Xiaobo Chen;Zhineng Chen;Yuan Zhang;Xieping Gao
{"title":"Multi-Perspective Pseudo-Label Generation and Confidence-Weighted Training for Semi-Supervised Semantic Segmentation","authors":"Kai Hu;Xiaobo Chen;Zhineng Chen;Yuan Zhang;Xieping Gao","doi":"10.1109/TMM.2024.3521801","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521801","url":null,"abstract":"Self-training has been shown to achieve remarkable gains in semi-supervised semantic segmentation by creating pseudo-labels using unlabeled data. This approach, however, suffers from the quality of the generated pseudo-labels, and generating higher quality pseudo-labels is the main challenge that needs to be addressed. In this paper, we propose a novel method for semi-supervised semantic segmentation based on Multi-perspective pseudo-label Generation and Confidence-weighted Training (MGCT). First, we present a multi-perspective pseudo-label generation strategy that considers both global and local semantic perspectives. This strategy prioritizes pixels in all images by the global and local predictions, and subsequently generates pseudo-labels for different pixels in stages according to the ranking results. Our pseudo-label generation method shows superior suitability for semi-supervised semantic segmentation compared to other approaches. Second, we propose a confidence-weighted training method to alleviate performance degradation caused by unstable pixels. Our training method assigns confident weights to unstable pixels, which reduces the interference of unstable pixels during training and facilitates the efficient training of the model. Finally, we validate our approach on the PASCAL VOC 2012 and Cityscapes datasets, and the results indicate that we achieve new state-of-the-art performance on both datasets in all settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"300-311"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Primary Code Guided Targeted Attack against Cross-modal Hashing Retrieval","authors":"Xinru Guo;Huaxiang Zhang;Li Liu;Dongmei Liu;Xu Lu;Hui Meng","doi":"10.1109/TMM.2024.3521697","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521697","url":null,"abstract":"Deep hashing algorithms have demonstrated considerable success in recent years, particularly in cross-modal retrieval tasks. Although hash-based cross-modal retrieval methods have demonstrated considerable efficacy, the vulnerability of deep networks to adversarial examples represents a significant challenge for the hash retrieval. In the absence of target semantics, previous non-targeted attack methods attempt to attack depth models by adding disturbance to the input data, yielding some positive outcomes. Nevertheless, they still lack specific instance-level hash codes and fail to consider the diversity and semantic association of different modalities, which is insufficient to meet the attacker's expectations. In response, we present a novel Primary code Guided Targeted Attack (PGTA) against cross-modal hashing retrieval. Specifically, we integrate cross-modal instances and labels to obtain well-fused target semantics, thereby enhancing cross-modal interaction. Secondly, the primary code is designed to generate discriminable information with fine-grained semantics for target labels. Benign samples and target semantics collectively generate adversarial examples under the guidance of primary codes, thereby enhancing the efficacy of targeted attacks. Extensive experiments demonstrate that our PGTA outperforms the most advanced methods on three datasets, achieving State-of-the-Art targeted attack performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"312-326"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shichao Zhang;Yibo Ding;Tianxiang Huo;Shukai Duan;Lidan Wang
{"title":"PointAttention: Rethinking Feature Representation and Propagation in Point Cloud","authors":"Shichao Zhang;Yibo Ding;Tianxiang Huo;Shukai Duan;Lidan Wang","doi":"10.1109/TMM.2024.3521745","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521745","url":null,"abstract":"Self-attention mechanisms have revolutionized natural language processing and computer vision. However, in point cloud analysis, most existing methods focus on point convolution operators for feature extraction, but fail to model long-range and hierarchical dependencies. To overcome above issues, in this paper, we present PointAttention, a novel network for point cloud feature representation and propagation. Specifically, this architecture uses a two-stage Learnable Self-attention for long-range attention weights learning, which is more effective than conventional triple attention. Furthermore, it employs a Hierarchical Learnable Attention Mechanism to formulate momentous global prior representation and perform fine-grained context understanding, which enables our framework to break through the limitation of the receptive field and reduce the loss of contexts. Interestingly, we show that the proposed Learnable Self-attention is equivalent to the coupling of two Softmax attention operations while having lower complexity. Extensive experiments demonstrate that our network achieves highly competitive performance on several challenging publicly available benchmarks, including point cloud classification on ScanObjectNN and ModelNet40, and part segmentation on ShapeNet-Part.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"327-339"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiguang Miao;Wentian Xin;Ruyi Liu;Yi Liu;Mengyao Wu;Cheng Shi;Chi-Man Pun
{"title":"Adaptive Pitfall: Exploring the Effectiveness of Adaptation in Skeleton-Based Action Recognition","authors":"Qiguang Miao;Wentian Xin;Ruyi Liu;Yi Liu;Mengyao Wu;Cheng Shi;Chi-Man Pun","doi":"10.1109/TMM.2024.3521774","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521774","url":null,"abstract":"Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition by exploiting the adjacency topology of body representation. However, the adaptive strategy adopted by the previous methods to construct the adjacency matrix is not balanced between the performance and the computational cost. We assume this concept of <italic>Adaptive Trap</i>, which can be replaced by multiple autonomous submodules, thereby simultaneously enhancing the dynamic joint representation and effectively reducing network resources. To effectuate the substitution of the adaptive model, we unveil two distinct strategies, both yielding comparable effects. (1) Optimization. <italic>Individuality and Commonality GCNs (IC-GCNs)</i> is proposed to specifically optimize the construction method of the associativity adjacency matrix for adaptive processing. The uniqueness and co-occurrence between different joint points and frames in the skeleton topology are effectively captured through methodologies like preferential fusion of physical information, extreme compression of multi-dimensional channels, and simplification of self-attention mechanism. (2) Replacement. <italic>Auto-Learning GCNs (AL-GCNs)</i> is proposed to boldly remove popular adaptive modules and cleverly utilize human key points as motion compensation to provide dynamic correlation support. AL-GCNs construct a fully learnable group adjacency matrix in both spatial and temporal dimensions, resulting in an elegant and efficient GCN-based model. In addition, three effective tricks for skeleton-based action recognition (Skip-Block, Bayesian Weight Selection Algorithm, and Simplified Dimensional Attention) are exposed and analyzed in this paper. Finally, we employ the variable channel and grouping method to explore the hardware resource bound of the two proposed models. IC-GCN and AL-GCN exhibit impressive performance across NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA, and UAV-Human datasets, with an exceptional parameter-cost ratio.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"56-71"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}