{"title":"Hierarchical Context Measurement Network for Single Hyperspectral Image Super-Resolution","authors":"Heng Wang;Cong Wang;Yuan Yuan","doi":"10.1109/TMM.2025.3535371","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535371","url":null,"abstract":"Single hyperspectral image super-resolution aims to enhance the spatial resolution of a hyperspectral image without relying on any auxiliary information. Despite the abundant spectral information, the inherent high-dimensionality in hyperspectral images still remains a challenge for memory efficiency. Recently, recursion-based methods have been proposed to reduce memory requirements. However, these methods utilize the reconstruction features as feedback embedding to explore context information, leading to sub-optimal performance as they ignore the complementarity of different hierarchical levels of information in the context. Additionally, existing methods equivalently compensate the previous feedback information to the current band, resulting in an indistinct and untargeted introduction of the context. In this paper, we propose a hierarchical context measurement network to construct corresponding measurement strategies for different hierarchical information, capturing comprehensive and powerful complementary knowledge from the context. Specifically, a feature-wise similarity measurement module is designed to calculate global cross-layer relationships between the middle features of the current band and those of the context, so as to explore the embedded middle features discriminatively through generated global dependencies. Furthermore, considering the pixel-wise correspondence between the reconstruction features and the super-resolved results, we propose a pixel-wise similarity measurement module for the complementary reconstruction features embedding, exploring detailed complementary information within the embedded reconstruction features by dynamically generating a spatially adaptive filter for each pixel. Experimental results reported on three benchmark hyperspectral datasets reveal that the proposed method outperforms other state-of-the-art peers in both visual and metric evaluations.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2623-2637"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Songlin Dong;Yingjie Chen;Yuhang He;Yuhan Jin;Alex C. Kot;Yihong Gong
{"title":"Analogical Augmentation and Significance Analysis for Online Task-Free Continual Learning","authors":"Songlin Dong;Yingjie Chen;Yuhang He;Yuhan Jin;Alex C. Kot;Yihong Gong","doi":"10.1109/TMM.2025.3535384","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535384","url":null,"abstract":"Online task-free continual learning (OTFCL) is a more challenging variant of continual learning that emphasizes the gradual shift of task boundaries and learning in an online mode. Existing methods rely on a memory buffer of old samples to prevent forgetting. However, the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the <bold>Inter-Class Analogical Augmentation</b> (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the <bold>Intra-Class Significance Analysis</b> (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates an importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3370-3382"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuwu Lu;Dewei Lin;Linlin Shen;Yicong Zhou;Jiahui Pan
{"title":"Heterogeneous Domain Adaptation via Correlative and Discriminative Feature Learning","authors":"Yuwu Lu;Dewei Lin;Linlin Shen;Yicong Zhou;Jiahui Pan","doi":"10.1109/TMM.2025.3535346","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535346","url":null,"abstract":"Heterogeneous domain adaptation seeks to learn an effective classifier or regression model for unlabeled target samples by using the well-labeled source samples but residing in different feature spaces and lying different distributions. Most recent works have concentrated on learning domain-invariant feature representations to minimize the distribution divergence via target pseudo-labels. However, two critical issues need to be further explored: 1) new feature representations should be not only domain-invariant but also category-correlative and discriminative and 2) alleviating the negative transfer caused by the incorrect pseudo-labeling target samples could boost the adaptation performance during the iterative learning process. To address these issues, in this paper, we put forward a novel heterogeneous domain adaptation method to learn category-correlative and discriminative representations, referred to as correlative and discriminative feature learning (CDFL). Specifically, CDFL aims to learn a feature space where class-specific feature correlations between the source and target domains are maximized, the divergences of marginal and conditional distribution between the source and target domains are minimized, and the distances of inter-class distribution are forced to be maximized to ensure the discriminative ability. Meanwhile, a selective pseudo-labeling procedure based on the correlation coefficient and classifier prediction is introduced to boost class-specific feature correlation and discriminative distribution alignment in an iteration way. Extensive experiments certify that CDFL outperforms the State-of-the-Art algorithms on five standard benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3447-3461"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhou;Kang Lin;Weipeng Hu;Chao Xie;Tao Su;Haifeng Hu;Yap-Peng Tan
{"title":"Snippet-Inter Difference Attention Network for Weakly-Supervised Temporal Action Localization","authors":"Wei Zhou;Kang Lin;Weipeng Hu;Chao Xie;Tao Su;Haifeng Hu;Yap-Peng Tan","doi":"10.1109/TMM.2025.3535336","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535336","url":null,"abstract":"The purpose of weakly-supervised temporal action localization (WTAL) task is to simultaneously classify and localize action instances in untrimmed videos with only video-level labels. Previous works fail to extract multi-scale temporal features to identify action instances with different durations, and they do not fully use the temporal cues of action video to learn discriminative features. In addition, the classifiers trained by current methods usually focus on easy-to-distinguish snippets while ignoring other semantically ambiguous features, which leads to incomplete and over-complete localization. To address these issues, we introduce a new Snippet-inter Difference Attention Network (SDANet) for WTAL, which can be trained end-to-end. Specifically, our model presents three modules, with primary contributions lying in the snippet-inter difference attention (SDA) module and potential feature mining (PFM) module. Firstly, we construct a simple multi-scale temporal feature fusion (MTFF) module to generate multi-scale temporal feature representation, so as to help the model better detect short action instances. Secondly, we consider the temporal cues of video features and design SDA module based on the Transformer to capture global discriminative features for each modality based on multi-scale features. It calculates the differences between temporal neighbor snippets in each modality to explore salient-difference features, and then utilizes them to guide correlation modeling. Thirdly, after learning discriminative features, we devise PFM module to excavate potential action and background snippets from ambiguous features. By contrastive learning, potential actions are forced closer to discriminative actions and away from the background, thereby learning more accurate action boundaries. Finally, two losses (i.e., similarity loss and reconstruction loss) are further developed to constrain the consistency between two modalities and help the model retain original feature information for better localization results. Extensive experiments show that our model achieves better performance against current WTAL methods on three datasets, i.e., THUMOS14, ActivityNet1.2 and ActivityNet1.3.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3610-3624"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EAT: Multi-Exposure Image Fusion With Adversarial Learning and Focal Transformer","authors":"Wei Tang;Fazhi He","doi":"10.1109/TMM.2025.3535390","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535390","url":null,"abstract":"In this article, different from previous traditional multi-exposure image fusion (MEF) algorithms that use hand-designed feature extraction approaches or deep learning-based algorithms that utilize convolutional neural networks for information preservation, we propose a novel multi-Exposure image fusion method via Adversarial learning and focal Transformer, named EAT. In our framework, a Focal Transformer is proposed to focus on more remarkable regions and construct long-range multi-exposure relationships, with which the fusion model can simultaneously extract local and global multi-exposure properties and therefore generate promising fusion results. To further improve the fusion performance, we introduce adversarial learning to train the proposed method in an adversarial manner with the guidance of ground truth. By doing so, the fused images exhibit better visual perception and color fidelity. Extensive experiments conducted on publicly available databases provide compelling evidence that EAT surpasses other state-of-the-art approaches on both quantitative and qualitative evaluations. Furthermore, we directly employ our trained model to address another benchmark MEF dataset. The impressive fusion performance serves as evidence of the credible generalization ability of EAT.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3744-3754"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Human-Centered Financial Signal Analysis Based on Visual Patterns in Stock Charts","authors":"Ji-Feng Luo;Yuzhen Chen;Kaixun Zhang;Xudong An;Menghan Hu;Guangtao Zhai;Xiao-Ping Zhang","doi":"10.1109/TMM.2025.3535278","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535278","url":null,"abstract":"The study adopted a human-centered perspective to research the financial markets, focusing on identifying variations in eye movement patterns between professional and non-professional traders as they analyze a series of stock charts. Eye movement data was selected as the analysis target based on the hypothesis that it represents a behavioral phenotype indicative of stock analysts' cognitive processes during market analysis. Disparities were identified by conducting variance analysis and the Wilcoxon signed-rank test on statistical metrics derived from eye fixations and saccades. Psychological and behavioral economic interpretations were provided to understand the underlying reasons for these observed patterns. To showcase the practical application potential of the human-centered perspective, eye movement data and human visual characteristics were used to construct visual saliency prediction models of professional stock analysts. Leveraging this human-centered model, we developed two practical application demonstrations specifically designed to support and instruct novice traders. Based on the above demonstrations, a training program was designed that demonstrates how, with ongoing training, the non-professional traders' ability to observe stock charts improves progressively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4193-4205"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cloud-Based Privacy-Preserving Medical Images Storage Scheme With Low Consumption","authors":"Yaolin Yang;Hongjie He;Zhuo Feng;Fan Chen;Yuan Yuan","doi":"10.1109/TMM.2025.3535335","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535335","url":null,"abstract":"For the security risks and high transmission/storage consumption in cloud-based medical images storage systems (CMISS), reversible data hiding in encrypted images (RDHEI) provide an effective solution. Nevertheless, challenges persist concerning the security risks cause by key transmission and the large file size of encrypted medical images. Consequently, a cloud-based privacy-preserving medical images storage scheme with low consumption is proposed in this paper. First, RDHEI is applied to CMISS, where image encryption achieves privacy protection, reversible data hiding eliminates extra space consumption by index data self-hiding, and the reversibility enables lossless recovery and extraction of medical images and index data. Then, hybrid encryption is designed to achieve high security. The security of encrypted images is guaranteed by combining a one-time cryptosystem with symmetric XOR encryption, which makes our scheme can resist various attacks. Time-varying key used in XOR is encrypted by asymmetric RSA, and only public key is used in RSA, avoiding the risk of private key transmission. Finally, to reduce the file size of encrypted images and achieve low consumption, context Huffman coding is proposed to adaptively selects the block coding method by context and thresholds, and has at most 98 056 bits shorter than Huffman coding in encoded stream length. Experimental results show that the proposed scheme has better performance in terms on security, consumption, and reversibility. The minimum compression ratio in databases is 32.46%, which is 2.63% lower than the existing schemes. And the medical image and index data can be restored lossless.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3556-3570"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models","authors":"Jing Hao;Moyun Liu;Jinrong Yang;Kuo Feng Hung","doi":"10.1109/TMM.2025.3535404","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535404","url":null,"abstract":"Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168 k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple <bold>G</b>lass surface s<bold>E</b>g<bold>M</b>entor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3501-3512"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Listen With Seeing: Cross-Modal Contrastive Learning for Audio-Visual Event Localization","authors":"Chao Sun;Min Chen;Chuanbo Zhu;Sheng Zhang;Ping Lu;Jincai Chen","doi":"10.1109/TMM.2025.3535359","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535359","url":null,"abstract":"In real-world physiological and psychological scenarios, there often exists a robust complementary correlation between audio and visual signals. Audio-Visual Event Localization (AVEL) aims to identify segments with Audio-Visual Events (AVEs) that contain both audio and visual tracks in unconstrained videos. Prior studies have predominantly focused on audio-visual cross-modal fusion methods, overlooking the fine-grained exploration of the cross-modal information fusion mechanism. Moreover, due to the inherent heterogeneity of multi-modal data, inevitable new noise is introduced during the audio-visual fusion process. To address these challenges, we propose a novel Cross-modal Contrastive Learning Network (CCLN) for AVEL, comprising a backbone network and a branch network. In the backbone network, drawing inspiration from physiological theories of sensory integration, we elucidate the process of audio-visual information fusion, interaction, and integration from an information-flow perspective. Notably, the Self-constrained Bi-modal Interaction (SBI) module is a bi-modal attention structure integrated with audio-visual fusion information, and through gated processing of the audio-visual correlation matrix, it effectively captures inter-modal correlation. The Foreground Event Enhancement (FEE) module emphasizes the significance of event-level boundaries by elongating the distance between scene events during training through adaptive weights. Furthermore, we introduce weak video-level labels to constrain the cross-modal semantic alignment of audio-visual events and design a weakly supervised cross-modal contrastive learning loss (WCCL Loss) function, which enhances the quality of fusion representation in the dual-branch contrastive learning framework. Extensive experiments conducted on the AVE dataset for both fully supervised and weakly supervised event localization, as well as Cross-Modal Localization (CML) tasks, demonstrate the superior performance of our model compared to state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2650-2665"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One is All: A Unified Rate-Distortion-Complexity Framework for Learned Image Compression Under Energy Concentration Criteria","authors":"Chao Li;Tianyi Li;Fanyang Meng;Qingyu Mao;Youneng Bao;Yonghong Tian;Yongsheng Liang","doi":"10.1109/TMM.2025.3535279","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535279","url":null,"abstract":"The learned image compression (LIC) technique has surpassed the state-of-the-art traditional codecs (H.266/VVC) in case of rate-distortion (R-D) performance. Its real-time deployments are far advanced. In order to achieve more flexible deployments, an LIC technique should be flexible in adjusting its computational complexity and rate as demanded by a situation and its environment. In this paper, we propose a unified Rate-Distortion-Complexity (R-D-C) framework for LIC under channel energy concentration criteria. Specifically, we first introduce an Energy Asymptotic Nonlinear Transformation (EANT) designed to directly concentrate on the channel energy of latent representations, thus laying the groundwork for a scalable entropy coding. Next, leveraging this energy concentration characteristic, we propose a corresponding Heterogeneous Scalable Entropy Model (HSEM) for flexibly scaling bitstreams as needed. Finally, utilizing the proposed EANT, we construct a fine-grained scalable codec for formulating, in combination with HSEM, a comprehensive scalable R-D-C framework under the energy concentration criteria. The obtained experimental results demonstrate that the proposed method could enable seamless transitions between 13 different widths of sub-models within a single network, allowing for fine-grained control over the model bitrate, complexity, and hardware inference time. Additionally, the proposed method exhibits competitive R-D performance compared to many existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3992-4007"},"PeriodicalIF":8.4,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}