Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang
{"title":"DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication","authors":"Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang","doi":"10.1109/TMM.2024.3414660","DOIUrl":"10.1109/TMM.2024.3414660","url":null,"abstract":"Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10879-10891"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments","authors":"Xinran Li;Zichi Wang;Guorui Feng;Xinpeng Zhang;Chuan Qin","doi":"10.1109/TMM.2024.3405660","DOIUrl":"10.1109/TMM.2024.3405660","url":null,"abstract":"Due to the limited number of stable image feature descriptors and the simplistic concatenation approach to hash generation, existing hashing methods have not achieved a satisfactory balance between robustness and discrimination. To this end, a novel perceptual hashing method is proposed in this paper using feature fusion of fractional-order continuous orthogonal moments (FrCOMs). Specifically, two robust image descriptors, i.e., fractional-order Chebyshev Fourier moments (FrCHFMs) and fractional-order radial harmonic Fourier moments (FrRHFMs), are used to extract global structural features of a color image. Then, the canonical correlation analysis (CCA) strategy is employed to fuse these features during the final hash generation process. Compared to direct concatenation, CCA excels in eliminating redundancies between feature vectors, resulting in a shorter hash sequence and higher authentication performance. A series of experiments demonstrate that the proposed method achieves satisfactory robustness, discrimination and security. Particularly, the proposed method exhibits better tampering detection ability and robustness against combined content-preserving manipulations in practical applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10041-10054"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Jiang;Peirong Ning;Jiayu Yang;Yongqi Zhai;Feng Gao;Ronggang Wang
{"title":"LLIC: Large Receptive Field Transform Coding With Adaptive Weights for Learned Image Compression","authors":"Wei Jiang;Peirong Ning;Jiayu Yang;Yongqi Zhai;Feng Gao;Ronggang Wang","doi":"10.1109/TMM.2024.3416831","DOIUrl":"10.1109/TMM.2024.3416831","url":null,"abstract":"The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the \u0000<italic>first</i>\u0000 time in the learned image compression community, we introduce \u0000<italic>a few</i>\u0000 large kernel-based depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter point-wise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by \u0000<inline-formula><tex-math>$9.49%, 9.47%,;text{and}; 10.94%$</tex-math></inline-formula>\u0000 on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10937-10951"},"PeriodicalIF":8.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Screen-Shooting Resistant Watermarking With Grayscale Deviation Simulation","authors":"Yiyi Li;Xin Liao;Xiaoshuai Wu","doi":"10.1109/TMM.2024.3415415","DOIUrl":"10.1109/TMM.2024.3415415","url":null,"abstract":"With the prevalence of electronic devices in our daily lives, content leakages frequently occur, and to enable leakage tracing, screen-shooting resistant watermarking has attracted tremendous attention. However, current studies often overlook a thoughtful investigation of the cross-media screen-camera process and fail to consider the effect of grayscale deviation on the screen. In this paper, we propose \u0000<underline>s</u>\u0000creen-\u0000<underline>s</u>\u0000hooting \u0000<underline>d</u>\u0000istortion \u0000<underline>s</u>\u0000imulation (\u0000<inline-formula><tex-math>$bf {SSDS}$</tex-math></inline-formula>\u0000), which involves a grayscale deviation function for constructing a more practical noise layer. We divide SSDS into screen displaying and camera shooting. For screen displaying, different viewing angles result in grayscale deviation with distinct intensities, and we simulate the distortions by modeling the relative position of the viewing point and the screen plane. For camera shooting, a series of distortion functions are used to approximate the perturbations in the camera pipeline, including defocus blur, noise and JPEG compression. Furthermore, the gradient-guided encoder is designed to conduct the embedding in the texture region using a modification cost map. Experimental results show that our proposed watermarking framework outperforms the state-of-the-art methods in terms of robustness and visual quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10908-10923"},"PeriodicalIF":8.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deeply Hybrid Contrastive Learning Based on Semantic Pseudo-Label for Salient Object Detection in Optical Remote Sensing Images","authors":"Yu Qiu;Yuhang Sun;Jie Mei;Jing Xu","doi":"10.1109/TMM.2024.3414669","DOIUrl":"10.1109/TMM.2024.3414669","url":null,"abstract":"Salient object detection in natural scene images (NSI-SOD) has undergone remarkable advancements in recent years. However, compared to those of natural images, the properties of remote sensing images (ORSIs), such as diverse spatial resolutions, complex background structures, and varying visual attributes of objects, are more complicated. Hence, how to explore the multiscale structural perceptual information of ORSIs to accurately detect salient objects is more challenging. In this paper, inspired by the superiority of contrastive learning, we propose a novel training paradigm for ORSI-SOD, named Deeply Hybrid Contrastive Learning Based on Semantic Pseudo-Label (DHCont), to force the network to extract rich structural perceptual information and further learn the better-structured feature embedding spaces. Specifically, DHCont first splits the ORSI into several local subregions composed of color- and texture-similar pixels, which act as semantic pseudo-labels. This strategy can effectively explore the underdeveloped semantic categories in ORSI-SOD. To delve deeper into multiscale structure-aware optimization, DHCont incorporates a hybrid contrast strategy that integrates “pixel-to-pixel”, “region-to-region”, “pixel-to-region”, and “region-to-pixel” contrasts at multiple scales. Additionally, to enhance the edge details of salient regions, we develop a hard edge contrast strategy that focuses on improving the detection accuracy of hard pixels near the object boundary. Moreover, we introduce a deep contrast algorithm that adds additional deep-level constraints to the feature spaces of multiple stages. Extensive experiments on two popular ORSI-SOD datasets demonstrate that simply integrating our DHCont into the existing ORSI-SOD models can significantly improve the performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10892-10907"},"PeriodicalIF":8.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance","authors":"Guanlin Li;Bin Zhao;Xuelong Li","doi":"10.1109/TMM.2024.3414328","DOIUrl":"10.1109/TMM.2024.3414328","url":null,"abstract":"Low-light images often suffer from severe detail lost in darker areas and non-uniform illumination distribution across distinct regions. Thus, structure modeling and region-specific illumination manipulation are crucial for high-quality enhanced image generation. However, previous methods encounter limitations in exploring robust structure priors and lack adequate modeling of illumination relationships among different regions, resulting in structure artifacts and color deviations. To alleviate this limitation, we propose a Segmentation-Guided Framework (SGF) which integrates the constructed robust segmentation priors to guide the enhancement process. Specifically, SGF first constructs a robust image-level edge prior based on the segmentation results of the Segment Anything Model (SAM) in a zero-shot manner. Then, we generate lighted-up region-aware feature-level prior by incorporating region-aware dynamic convolution. To adequately model long-distance illumination interactions across distinct regions, we design a segmentation-guided transformer block (SGTB), which utilizes the lighted-up region-aware feature-level prior to guide self-attention calculation. By arranging the SGTBs in a symmetric hierarchical structure, we derive a segmentation-guided enhancement module that operates under the guidance of both the image and feature-level priors. Comprehensive experimental results show that our SGF performs remarkably in both quantitative evaluation and visual comparison.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10854-10866"},"PeriodicalIF":8.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxuan Luo;Runmin Cong;Xialei Liu;Horace Ho Shing Ip;Sam Kwong
{"title":"Modeling Inner- and Cross-Task Contrastive Relations for Continual Image Classification","authors":"Yuxuan Luo;Runmin Cong;Xialei Liu;Horace Ho Shing Ip;Sam Kwong","doi":"10.1109/TMM.2024.3414277","DOIUrl":"10.1109/TMM.2024.3414277","url":null,"abstract":"Existing continual image classification methods demonstrate that samples from all sequences of continual classification tasks contain common (task-invariant) features and class-specific (task-variant) features that can be decoupled for classification tasks. However, the existing feature decomposition strategies only focus on individual tasks while neglecting the essential cues that the relationship between different tasks can provide, thereby hindering the improvement of continual image classification results. To address this issue, we propose an Adversarial Contrastive Continual Learning (ACCL) method that decouples task-invariant and task-variant features by constructing all-round, multi-level contrasts on sample pairs within individual tasks or from different tasks. Specifically, three constraints on the distribution of task-invariant and task-variant features are included, i.e., task-invariant features across different tasks should remain consistent, task-variant features should exhibit differences, and task-invariant and task-variant features should differ from each other. At the same time, we also design an effective contrastive replay strategy to make full use of the replay samples to participate in the construction of sample pairs, further alleviating the forgetting problem, and modeling cross-task relationships. Through extensive experiments on continual image classification tasks on CIFAR100, MiniImageNet and TinyImageNet, we show the superiority of our proposed strategy, improving the accuracy and with better visualized outcomes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10842-10853"},"PeriodicalIF":8.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bilateral Interaction for Local-Global Collaborative Perception in Low-Light Image Enhancement","authors":"Rui Xu;Yuezhou Li;Yuzhen Niu;Huangbiao Xu;Yuzhong Chen;Tiesong Zhao","doi":"10.1109/TMM.2024.3413293","DOIUrl":"10.1109/TMM.2024.3413293","url":null,"abstract":"Low-light image enhancement is a challenging task due to the limited visibility in dark environments. While recent advances have shown progress in integrating CNNs and Transformers, the inadequate local-global perceptual interactions still impedes their application in complex degradation scenarios. To tackle this issue, we propose BiFormer, a lightweight framework that facilitates local-global collaborative perception via bilateral interaction. Specifically, our framework introduces a core CNN-Transformer collaborative perception block (CPB) that combines local-aware convolutional attention (LCA) and global-aware recursive Transformer (GRT) to simultaneously preserve local details and ensure global consistency. To promote perceptual interaction, we adopt bilateral interaction strategy for both local and global perception, which involves local-to-global second-order interaction (SoI) in the dual-domain, as well as a mixed-channel fusion (MCF) module for global-to-local interaction. The MCF is also a highly efficient feature fusion module tailored for degraded features. Extensive experiments conducted on low-level and high-level tasks demonstrate that BiFormer achieves state-of-the-art performance. Furthermore, it exhibits a significant reduction in model parameters and computational cost compared to existing Transformer-based low-light image enhancement methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10792-10804"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liangchen Liu;Nannan Wang;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu
{"title":"Towards Specific Domain Prompt Learning via Improved Text Label Optimization","authors":"Liangchen Liu;Nannan Wang;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu","doi":"10.1109/TMM.2024.3413318","DOIUrl":"10.1109/TMM.2024.3413318","url":null,"abstract":"Prompt learning has emerged as a thriving parameter-efficient fine-tuning technique for adapting pre-trained vision-language models (VLMs) to various downstream tasks. However, existing prompt learning approaches still exhibit limited capability for adapting foundational VLMs to specific domains that require specialized and expert-level knowledge. Since this kind of specific knowledge is primarily embedded in the pre-defined text labels, we infer that foundational VLMs cannot directly interpret semantic meaningful information from these specific text labels, which causes the above limitation. From this perspective, this paper additionally models text labels with learnable tokens and casts this operation into traditional prompt learning framework. By optimizing label tokens, semantic meaningful text labels are automatically learned for each class. Nevertheless, directly optimizing text label still remains two critical problems, i.e., insufficient optimization and biased optimization. We further address these problems by proposing Modality Interaction Text Label Optimization (MITLOp) and Color-based Consistency Augmentation (CCAug) respectively, thereby effectively improving the quality of the optimized text labels. Extensive experiments indicate that our proposed method achieves significant improvements in VLM adaptation on specific domains.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10805-10815"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ADMNet: Attention-Guided Densely Multi-Scale Network for Lightweight Salient Object Detection","authors":"Xiaofei Zhou;Kunye Shen;Zhi Liu","doi":"10.1109/TMM.2024.3413529","DOIUrl":"10.1109/TMM.2024.3413529","url":null,"abstract":"Recently, benefitting from the rapid development of deep learning technology, the research of salient object detection has achieved great progress. However, the performance of existing cutting-edge saliency models relies on large network size and high computational overhead. This is unamiable to real-world applications, especially the practical platforms with low cost and limited computing resources. In this paper, we propose a novel lightweight saliency model, namely Attention-guided Densely Multi-scale Network (ADMNet), to tackle this issue. Firstly, we design the multi-scale perception (MP) module to acquire different contextual features by using different receptive fields. Embarking on MP module, we build the encoder of our model, where each convolutional block adopts a dense structure to connect MP modules. Following this way, our model can provide powerful encoder features for the characterization of salient objects. Secondly, we employ dual attention (DA) module to equip the decoder blocks. Particularly, in DA module, the binarized coarse saliency inference of the decoder block (\u0000<italic>i.e.</i>\u0000, a hard spatial attention map) is first employed to filter out interference cues from the decoder feature, and then by introducing large receptive fields, the enhanced decoder feature is used to generate a soft spatial attention map, which further purifies the fused features. Following this way, the deep features are steered to give more concerns to salient regions. Extensive experiments on five public challenging datasets including ECSSD, DUT-OMRON, DUTS-TE, HKU-IS, and PASCAL-S clearly show that our model achieves comparable performance with the state-of-the-art saliency models while running at a 219.4fps GPU speed and a 1.76fps CPU speed for a 368×368 image with only 0.84 M parameters.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10828-10841"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}