{"title":"CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis","authors":"Yingli Hou;Wei Zhang;Zhiliang Zhu;Hai Yu","doi":"10.1109/TMM.2025.3535304","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535304","url":null,"abstract":"Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3702-3715"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuyi Mao;Xinpeng Li;Fan Zhang;Xiaojiang Peng;Yang Yang
{"title":"Facial Action Units as a Joint Dataset Training Bridge for Facial Expression Recognition","authors":"Shuyi Mao;Xinpeng Li;Fan Zhang;Xiaojiang Peng;Yang Yang","doi":"10.1109/TMM.2025.3535327","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535327","url":null,"abstract":"Label biases in facial expression recognition (FER) datasets, caused by annotators' subjectivity, pose challenges in improving the performance of target datasets when auxiliary labeled data are used. Moreover, training with multiple datasets can lead to visible degradations in the target dataset. To address these issues, we propose a novel framework called the AU-aware Vision Transformer (AU-ViT), which leverages unified action unit (AU) information and discards expression annotations of auxiliary data. AU-ViT integrates an elaborately designed AU branch in the middle part of a master ViT to enhance representation learning during training. Through qualitative and quantitative analyses, we demonstrate that AU-ViT effectively captures expression regions and is robust to real-world occlusions. Additionally, we observe that AU-ViT also yields performance improvements on the target dataset, even without auxiliary data, by utilizing pseudo AU labels. Our AU-ViT achieves performances superior to, or comparable to, that of the state-of-the-art methods on FERPlus, RAFDB, AffectNet, LSD and the other three occlusion test datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3331-3342"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Musrea Abdo Ghaseb;Ahmed Elhayek;Fawaz Alsolami;Abdullah Marish Ali
{"title":"S3GAAR: Segmented Spatiotemporal Skeleton Graph-Attention for Action Recognition","authors":"Musrea Abdo Ghaseb;Ahmed Elhayek;Fawaz Alsolami;Abdullah Marish Ali","doi":"10.1109/TMM.2025.3535284","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535284","url":null,"abstract":"Human motion recognition is extremely important for many practical applications in several disciplines, such as surveillance, medicine, sports, gait analysis, and computer graphics. Graph convolutional networks (GCNs) enhance the accuracy and performance of skeleton-based action recognition. However, this approach has difficulties in modeling long-term temporal dependencies. In Addition, the fixed topology of the skeleton graph is not sufficiently robust to extract features for skeleton motions. Although transformers that rely entirely on self-attention have demonstrated great success in modeling global correlations between inputs and outputs, they ignore the local correlations between joints. In this study, we propose a novel segmented spatiotemporal skeleton graph-attention network (S3GAAR) to effectively learn different human actions and concentrate on the most operative part of the human body for each action. The proposed S3GAAR models spatial-temporal features through spatiotemporal attention for each segment to capture short-term temporal dependencies. Owing to several human actions that focus on one or more body parts such as mutual actions, our novel method divides the human skeleton into three segments: superior, inferior, and extremity joints. Our proposed method is designed to extract the features of each segment individually because human actions focus on one or more segments. Moreover, our segmented spatiotemporal graph introduces additional edges between important distant joints in the same segment. The experimental results show that our novel method outperforms state-of-the-art methods up to 1.1% on two large-scale benchmark datasets, NTU-RGB+D 60 and NTU-RGB+D 120.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3437-3446"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Geometry-Aware Self-Supervised Indoor 360$^{circ }$ Depth Estimation via Asymmetric Dual-Domain Collaborative Learning","authors":"Xu Wang;Ziyan He;Qiudan Zhang;You Yang;Tiesong Zhao;Jianmin Jiang","doi":"10.1109/TMM.2025.3535340","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535340","url":null,"abstract":"Being able to estimate monocular depth for spherical panoramas is of fundamental importance in 3D scene perception. However, spherical distortion severely limits the effectiveness of vanilla convolutions. To push the envelope of accuracy, recent approaches attempt to utilize Tangent projection (TP) to estimate the depth of <inline-formula><tex-math>$360 ^{circ }$</tex-math></inline-formula> images. Yet, these methods still suffer from discrepancies and inconsistencies among patch-wise tangent images, as well as the lack of accurate ground truth depth maps under a supervised fashion. In this paper, we propose a geometry-aware self-supervised <inline-formula><tex-math>$360 ^{circ }$</tex-math></inline-formula> image depth estimation methodology that explores the complementary advantages of TP and Equirectangular projection (ERP) by an asymmetric dual-domain collaborative learning strategy. Especially, we first develop a lightweight asymmetric dual-domain depth estimation network, which enables to aggregate depth-related features from a single TP domain, and then produce depth distributions of the TP and ERP domains via collaborative learning. This effectively mitigates stitching artifacts and preserves fine details in depth inference without overspending model parameters. In addition, a frequent-spatial feature concentration module is devised to simultaneously capture non-local Fourier features and local spatial features, such that facilitating the efficient exploration of monocular depth cues. Moreover, we introduce a geometric structural alignment module to further improve geometric structural consistency among tangent images. Extensive experiments illustrate that our designed approach outperforms existing self-supervised <inline-formula><tex-math>$360 ^{circ }$</tex-math></inline-formula> depth estimation methods on three publicly available benchmark datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3224-3237"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingwei Pan;Yehao Li;Ting Yao;Chong-Wah Ngo;Tao Mei
{"title":"Stream-ViT: Learning Streamlined Convolutions in Vision Transformer","authors":"Yingwei Pan;Yehao Li;Ting Yao;Chong-Wah Ngo;Tao Mei","doi":"10.1109/TMM.2025.3535321","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535321","url":null,"abstract":"Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3755-3765"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wai Keung Wong;Lunke Fei;Jianyang Qin;Shuping Zhao;Jie Wen;Zhihao He
{"title":"Heterogeneous Pairwise-Semantic Enhancement Hashing for Large-Scale Cross-Modal Retrieval","authors":"Wai Keung Wong;Lunke Fei;Jianyang Qin;Shuping Zhao;Jie Wen;Zhihao He","doi":"10.1109/TMM.2025.3535401","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535401","url":null,"abstract":"Cross-modal hash learning has drawn widespread attention for large-scale multimodal retrieval because of its stability and efficiency in approximate similarity searches. However, most existing cross-modal hashing approaches employ discrete label-guided information to coarsely reflect intra- and intermodality correlations, making them less effective to measuring the semantic similarity of data with multiple modalities. In this paper, we propose a new heterogeneous pairwise-semantic enhancement hashing (HPsEH) for large-scale cross-modal retrieval by distilling higher-level pairwise-semantic similarity from supervision information. First, we adopt a supervised self-expression to learn a data-specific quantified semantic matrix, which uses real values to measure both the similarity and dissimilarity ranks of paired instances, such that the intrinsic semantics of the data can be well captured. Then, we fuse the label-based information and quantified semantic similarity to collaboratively learn the hash codes of multimodal data, such that both the intermodality consistency and modality-specific features can be simultaneously obtained during hash code learning. Moreover, we employ effective iterative optimization to address the discrete binary solution and massive pairwise matrix calculation, making the HPsEH scalable to large-scale datasets. Extensive experimental results on three widely used datasets demonstrate the superiority of our proposed HPsEH method over most state-of-the art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3238-3250"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency","authors":"Xiaoqing Liu;Huanqiang Zeng;Yifan Shi;Jianqing Zhu;Kaixiang Yang;Zhiwen Yu","doi":"10.1109/TMM.2025.3535378","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535378","url":null,"abstract":"In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose <bold>E</b>nsemble <bold>P</b>rototype <bold>Net</b>works (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3476-3488"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local Fine-Grained Visual Tracking","authors":"Jingjing Wu;Yifan Sun;Richang Hong","doi":"10.1109/TMM.2025.3535329","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535329","url":null,"abstract":"This paper introduces a novel local fine-grained visual tracking task, aiming to precisely locate arbitrary local parts of objects. This task is motivated by our observation that in many realistic scenarios, the user demands to track a local part instead of a holistic object. However, the absence of an evaluation dataset and the distinctive characteristics of local fine-grained targets present extra challenges in conducting this research. To tackle these issues, first, this paper constructs a local fine-grained tracking (LFT) dataset to evaluate the tracking performance for local fine-grained targets. Second, this paper designs a cutting-edge solution to handle the challenges posed by properties of local objects, including ambiguity and high-proportion backgrounds. It consists of a hierarchical adaptive mask mechanism and foreground-background differentiated learning. The former adaptively searches for and masks ambiguity, which drives the network to concentrate on the local target instead of the holistic objects. The latter is constructed to distinguish foreground and background in an unsupervised manner, which is beneficial to mitigate the impacts of high-proportion backgrounds. Extensive analytic experiments are performed to verify the effectiveness of each submodule in the proposed fine-grained tracker.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3426-3436"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning","authors":"Wujian Peng;Zejia Weng;Hengduo Li;Zuxuan Wu;Yu-Gang Jiang","doi":"10.1109/TMM.2025.3535115","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535115","url":null,"abstract":"Exploring a substantial amount of unlabeled data, semi-supervised learning boosts the recognition performance when only a limited number of labels are provided. However, conventional methods assume a class-balanced data distribution, which is difficult to realize in practice due to the long-tailed nature of real-world data. While addressing the data imbalance is a well-explored area in supervised learning paradigms, directly transferring existing approaches to SSL is nontrivial, as prior knowledge about unlabeled data distribution remains unknown in SSL. In light of this, we introduce the Balanced Memory Bank (BMB), a framework for long-tailed semi-supervised learning. The core of BMB is an online-updated memory bank that caches historical features alongside their corresponding pseudo-labels, and the memory is also carefully maintained to ensure the data therein are class-rebalanced. Furthermore, an adaptive weighting module is incorporated to work jointly with the memory bank to further re-calibrate the biased training process. Experimental results across various datasets demonstrate the superior performance of BMB compared with state-of-the-art approaches. For instance, an improvement of 8.2% on the 1% labeled subset of ImageNet127 and 4.3% on the 50% labeled subset of ImageNet-LT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3677-3687"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiebin Yan;Jiale Rao;Xuelin Liu;Yuming Fang;Yifan Zuo;Weide Liu
{"title":"Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images","authors":"Jiebin Yan;Jiale Rao;Xuelin Liu;Yuming Fang;Yifan Zuo;Weide Liu","doi":"10.1109/TMM.2025.3535372","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535372","url":null,"abstract":"Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the “same amount” of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo “different amount” of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users' viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2695-2707"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}