{"title":"Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset","authors":"Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao","doi":"10.1109/TMM.2024.3414549","DOIUrl":"10.1109/TMM.2024.3414549","url":null,"abstract":"Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10816-10827"},"PeriodicalIF":8.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Human-Centric Behavior Description in Videos: New Benchmark and Model","authors":"Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3414263","DOIUrl":"10.1109/TMM.2024.3414263","url":null,"abstract":"In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10867-10878"},"PeriodicalIF":8.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang
{"title":"DeepSpoof: Deep Reinforcement Learning-Based Spoofing Attack in Cross-Technology Multimedia Communication","authors":"Demin Gao;Liyuan Ou;Ye Liu;Qing Yang;Honggang Wang","doi":"10.1109/TMM.2024.3414660","DOIUrl":"10.1109/TMM.2024.3414660","url":null,"abstract":"Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10879-10891"},"PeriodicalIF":8.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Screen-Shooting Resistant Watermarking With Grayscale Deviation Simulation","authors":"Yiyi Li;Xin Liao;Xiaoshuai Wu","doi":"10.1109/TMM.2024.3415415","DOIUrl":"10.1109/TMM.2024.3415415","url":null,"abstract":"With the prevalence of electronic devices in our daily lives, content leakages frequently occur, and to enable leakage tracing, screen-shooting resistant watermarking has attracted tremendous attention. However, current studies often overlook a thoughtful investigation of the cross-media screen-camera process and fail to consider the effect of grayscale deviation on the screen. In this paper, we propose \u0000<underline>s</u>\u0000creen-\u0000<underline>s</u>\u0000hooting \u0000<underline>d</u>\u0000istortion \u0000<underline>s</u>\u0000imulation (\u0000<inline-formula><tex-math>$bf {SSDS}$</tex-math></inline-formula>\u0000), which involves a grayscale deviation function for constructing a more practical noise layer. We divide SSDS into screen displaying and camera shooting. For screen displaying, different viewing angles result in grayscale deviation with distinct intensities, and we simulate the distortions by modeling the relative position of the viewing point and the screen plane. For camera shooting, a series of distortion functions are used to approximate the perturbations in the camera pipeline, including defocus blur, noise and JPEG compression. Furthermore, the gradient-guided encoder is designed to conduct the embedding in the texture region using a modification cost map. Experimental results show that our proposed watermarking framework outperforms the state-of-the-art methods in terms of robustness and visual quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10908-10923"},"PeriodicalIF":8.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deeply Hybrid Contrastive Learning Based on Semantic Pseudo-Label for Salient Object Detection in Optical Remote Sensing Images","authors":"Yu Qiu;Yuhang Sun;Jie Mei;Jing Xu","doi":"10.1109/TMM.2024.3414669","DOIUrl":"10.1109/TMM.2024.3414669","url":null,"abstract":"Salient object detection in natural scene images (NSI-SOD) has undergone remarkable advancements in recent years. However, compared to those of natural images, the properties of remote sensing images (ORSIs), such as diverse spatial resolutions, complex background structures, and varying visual attributes of objects, are more complicated. Hence, how to explore the multiscale structural perceptual information of ORSIs to accurately detect salient objects is more challenging. In this paper, inspired by the superiority of contrastive learning, we propose a novel training paradigm for ORSI-SOD, named Deeply Hybrid Contrastive Learning Based on Semantic Pseudo-Label (DHCont), to force the network to extract rich structural perceptual information and further learn the better-structured feature embedding spaces. Specifically, DHCont first splits the ORSI into several local subregions composed of color- and texture-similar pixels, which act as semantic pseudo-labels. This strategy can effectively explore the underdeveloped semantic categories in ORSI-SOD. To delve deeper into multiscale structure-aware optimization, DHCont incorporates a hybrid contrast strategy that integrates “pixel-to-pixel”, “region-to-region”, “pixel-to-region”, and “region-to-pixel” contrasts at multiple scales. Additionally, to enhance the edge details of salient regions, we develop a hard edge contrast strategy that focuses on improving the detection accuracy of hard pixels near the object boundary. Moreover, we introduce a deep contrast algorithm that adds additional deep-level constraints to the feature spaces of multiple stages. Extensive experiments on two popular ORSI-SOD datasets demonstrate that simply integrating our DHCont into the existing ORSI-SOD models can significantly improve the performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10892-10907"},"PeriodicalIF":8.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Light Image Enhancement With SAM-Based Structure Priors and Guidance","authors":"Guanlin Li;Bin Zhao;Xuelong Li","doi":"10.1109/TMM.2024.3414328","DOIUrl":"10.1109/TMM.2024.3414328","url":null,"abstract":"Low-light images often suffer from severe detail lost in darker areas and non-uniform illumination distribution across distinct regions. Thus, structure modeling and region-specific illumination manipulation are crucial for high-quality enhanced image generation. However, previous methods encounter limitations in exploring robust structure priors and lack adequate modeling of illumination relationships among different regions, resulting in structure artifacts and color deviations. To alleviate this limitation, we propose a Segmentation-Guided Framework (SGF) which integrates the constructed robust segmentation priors to guide the enhancement process. Specifically, SGF first constructs a robust image-level edge prior based on the segmentation results of the Segment Anything Model (SAM) in a zero-shot manner. Then, we generate lighted-up region-aware feature-level prior by incorporating region-aware dynamic convolution. To adequately model long-distance illumination interactions across distinct regions, we design a segmentation-guided transformer block (SGTB), which utilizes the lighted-up region-aware feature-level prior to guide self-attention calculation. By arranging the SGTBs in a symmetric hierarchical structure, we derive a segmentation-guided enhancement module that operates under the guidance of both the image and feature-level priors. Comprehensive experimental results show that our SGF performs remarkably in both quantitative evaluation and visual comparison.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10854-10866"},"PeriodicalIF":8.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxuan Luo;Runmin Cong;Xialei Liu;Horace Ho Shing Ip;Sam Kwong
{"title":"Modeling Inner- and Cross-Task Contrastive Relations for Continual Image Classification","authors":"Yuxuan Luo;Runmin Cong;Xialei Liu;Horace Ho Shing Ip;Sam Kwong","doi":"10.1109/TMM.2024.3414277","DOIUrl":"10.1109/TMM.2024.3414277","url":null,"abstract":"Existing continual image classification methods demonstrate that samples from all sequences of continual classification tasks contain common (task-invariant) features and class-specific (task-variant) features that can be decoupled for classification tasks. However, the existing feature decomposition strategies only focus on individual tasks while neglecting the essential cues that the relationship between different tasks can provide, thereby hindering the improvement of continual image classification results. To address this issue, we propose an Adversarial Contrastive Continual Learning (ACCL) method that decouples task-invariant and task-variant features by constructing all-round, multi-level contrasts on sample pairs within individual tasks or from different tasks. Specifically, three constraints on the distribution of task-invariant and task-variant features are included, i.e., task-invariant features across different tasks should remain consistent, task-variant features should exhibit differences, and task-invariant and task-variant features should differ from each other. At the same time, we also design an effective contrastive replay strategy to make full use of the replay samples to participate in the construction of sample pairs, further alleviating the forgetting problem, and modeling cross-task relationships. Through extensive experiments on continual image classification tasks on CIFAR100, MiniImageNet and TinyImageNet, we show the superiority of our proposed strategy, improving the accuracy and with better visualized outcomes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10842-10853"},"PeriodicalIF":8.4,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bilateral Interaction for Local-Global Collaborative Perception in Low-Light Image Enhancement","authors":"Rui Xu;Yuezhou Li;Yuzhen Niu;Huangbiao Xu;Yuzhong Chen;Tiesong Zhao","doi":"10.1109/TMM.2024.3413293","DOIUrl":"10.1109/TMM.2024.3413293","url":null,"abstract":"Low-light image enhancement is a challenging task due to the limited visibility in dark environments. While recent advances have shown progress in integrating CNNs and Transformers, the inadequate local-global perceptual interactions still impedes their application in complex degradation scenarios. To tackle this issue, we propose BiFormer, a lightweight framework that facilitates local-global collaborative perception via bilateral interaction. Specifically, our framework introduces a core CNN-Transformer collaborative perception block (CPB) that combines local-aware convolutional attention (LCA) and global-aware recursive Transformer (GRT) to simultaneously preserve local details and ensure global consistency. To promote perceptual interaction, we adopt bilateral interaction strategy for both local and global perception, which involves local-to-global second-order interaction (SoI) in the dual-domain, as well as a mixed-channel fusion (MCF) module for global-to-local interaction. The MCF is also a highly efficient feature fusion module tailored for degraded features. Extensive experiments conducted on low-level and high-level tasks demonstrate that BiFormer achieves state-of-the-art performance. Furthermore, it exhibits a significant reduction in model parameters and computational cost compared to existing Transformer-based low-light image enhancement methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10792-10804"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liangchen Liu;Nannan Wang;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu
{"title":"Towards Specific Domain Prompt Learning via Improved Text Label Optimization","authors":"Liangchen Liu;Nannan Wang;Decheng Liu;Xi Yang;Xinbo Gao;Tongliang Liu","doi":"10.1109/TMM.2024.3413318","DOIUrl":"10.1109/TMM.2024.3413318","url":null,"abstract":"Prompt learning has emerged as a thriving parameter-efficient fine-tuning technique for adapting pre-trained vision-language models (VLMs) to various downstream tasks. However, existing prompt learning approaches still exhibit limited capability for adapting foundational VLMs to specific domains that require specialized and expert-level knowledge. Since this kind of specific knowledge is primarily embedded in the pre-defined text labels, we infer that foundational VLMs cannot directly interpret semantic meaningful information from these specific text labels, which causes the above limitation. From this perspective, this paper additionally models text labels with learnable tokens and casts this operation into traditional prompt learning framework. By optimizing label tokens, semantic meaningful text labels are automatically learned for each class. Nevertheless, directly optimizing text label still remains two critical problems, i.e., insufficient optimization and biased optimization. We further address these problems by proposing Modality Interaction Text Label Optimization (MITLOp) and Color-based Consistency Augmentation (CCAug) respectively, thereby effectively improving the quality of the optimized text labels. Extensive experiments indicate that our proposed method achieves significant improvements in VLM adaptation on specific domains.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10805-10815"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ADMNet: Attention-Guided Densely Multi-Scale Network for Lightweight Salient Object Detection","authors":"Xiaofei Zhou;Kunye Shen;Zhi Liu","doi":"10.1109/TMM.2024.3413529","DOIUrl":"10.1109/TMM.2024.3413529","url":null,"abstract":"Recently, benefitting from the rapid development of deep learning technology, the research of salient object detection has achieved great progress. However, the performance of existing cutting-edge saliency models relies on large network size and high computational overhead. This is unamiable to real-world applications, especially the practical platforms with low cost and limited computing resources. In this paper, we propose a novel lightweight saliency model, namely Attention-guided Densely Multi-scale Network (ADMNet), to tackle this issue. Firstly, we design the multi-scale perception (MP) module to acquire different contextual features by using different receptive fields. Embarking on MP module, we build the encoder of our model, where each convolutional block adopts a dense structure to connect MP modules. Following this way, our model can provide powerful encoder features for the characterization of salient objects. Secondly, we employ dual attention (DA) module to equip the decoder blocks. Particularly, in DA module, the binarized coarse saliency inference of the decoder block (\u0000<italic>i.e.</i>\u0000, a hard spatial attention map) is first employed to filter out interference cues from the decoder feature, and then by introducing large receptive fields, the enhanced decoder feature is used to generate a soft spatial attention map, which further purifies the fused features. Following this way, the deep features are steered to give more concerns to salient regions. Extensive experiments on five public challenging datasets including ECSSD, DUT-OMRON, DUTS-TE, HKU-IS, and PASCAL-S clearly show that our model achieves comparable performance with the state-of-the-art saliency models while running at a 219.4fps GPU speed and a 1.76fps CPU speed for a 368×368 image with only 0.84 M parameters.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10828-10841"},"PeriodicalIF":8.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141968842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}