{"title":"LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering","authors":"Dizhan Xue;Shengsheng Qian;Quan Fang;Changsheng Xu","doi":"10.1109/TMM.2024.3521709","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521709","url":null,"abstract":"Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task consisting of answering the visual question and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) task that only aims at predicting answers for visual questions, EVQA also aims to generate user-friendly explanations to improve the explainability and credibility of reasoning models. To date, existing methods for VQA and EVQA ignore the prompt in the question and enforce the model to predict the probabilities of all answers. Moreover, existing EVQA methods ignore the complex relationships among question words, visual regions, and explanation tokens. Therefore, in this work, we propose a Logic Integrated Neural Inference Network (LININ) to restrict the range of candidate answers based on first-order-logic (FOL) and capture cross-modal relationships to generate rational explanations. Firstly, we design a FOL-based question analysis program to fetch a small number of candidate answers. Secondly, we utilize a multimodal transformer encoder to extract visual and question features, and conduct the prediction on candidate answers. Finally, we design a multimodal explanation transformer to construct cross-modal relationships and generate rational explanations. Comprehensive experiments on benchmark datasets demonstrate the superiority of LININ compared with the state-of-the-art methods for EVQA.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"16-27"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Progressive Region-to-Boundary Exploration Network for Camouflaged Object Detection","authors":"Guanghui Yue;Shangjie Wu;Tianwei Zhou;Gang Li;Jie Du;Yu Luo;Qiuping Jiang","doi":"10.1109/TMM.2024.3521761","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521761","url":null,"abstract":"Camouflaged object detection (COD) aims to segment targeted objects that have similar colors, textures, or shapes to their background environment. Due to the limited ability in distinguishing highly similar patterns, existing COD methods usually produce inaccurate predictions, especially around the boundary areas, when coping with complex scenes. This paper proposes a Progressive Region-to-Boundary Exploration Network (PRBE-Net) to accurately detect camouflaged objects. PRBE-Net follows an encoder-decoder framework and includes three key modules. Specifically, firstly, both high-level and low-level features of the encoder are integrated by a region and boundary exploration module to explore their complementary information for extracting the object's coarse region and fine boundary cues simultaneously. Secondly, taking the region cues as the guidance information, a Region Enhancement (RE) module is used to adaptively localize and enhance the region information at each layer of the encoder. Subsequently, considering that camouflaged objects usually have blurry boundaries, a Boundary Refinement (BR) decoder is used after the RE module to better detect the boundary areas with the assistance of boundary cues. Through top-down deep supervision, PRBE-Net can progressively refine the prediction. Extensive experiments on four datasets indicate that our PRBE-Net achieves superior results over 21 state-of-the-art COD methods. Additionally, it also shows good results on polyp segmentation, a COD-related task in the medical field.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"236-248"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-Enhanced Confidence Calibration for Class-Incremental Unsupervised Domain Adaptation","authors":"Jiaping Yu;Muli Yang;Aming Wu;Cheng Deng","doi":"10.1109/TMM.2024.3521834","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521834","url":null,"abstract":"In this paper, we focus on Class-Incremental Unsupervised Domain Adaptation (CI-UDA), where the labeled source domain already includes all classes, and the classes in the unlabeled target domain emerge sequentially over time. This task involves addressing two main challenges. The first is the domain gap between the labeled source data and the unlabeled target data, which leads to weak generalization performance. The second is the inconsistency between the source and target category spaces at each time step, which causes catastrophic forgetting during the testing stage. Previous methods focus solely on the alignment of similar samples from different domains, which overlooks the underlying causes of the domain gap/class distribution difference. To tackle the issue, we rethink this task from a causal perspective for the first time. We first build a structural causal graph to describe the CI-UDA problem. Based on the causal graph, we present Memory-Enhanced Confidence Calibration (MECC), which aims to improve confidence in the predicted results. In particular, we argue that the domain discrepancy caused by the different styles is prone to make the model produce less confident predictions and thus weakens the generalization and continual learning abilities. To this end, we first explore using the gram matrix to generate source-style target data, which is combined with the original data to jointly train the model and thereby reduce the domain-shift impact. Second, we utilize the model of the previous time step to select corresponding samples that are used to build a memory bank, which is instrumental in alleviating catastrophic forgetting. Extensive experimental results on multiple datasets demonstrate the superiority of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"610-621"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCM-Net: A Diffusion Model-Based Detection Network Integrating the Characteristics of Copy-Move Forgery","authors":"Shaowei Weng;Jianhao Zhang;Tanguo Zhu;Lifang Yu;Chunyu Zhang","doi":"10.1109/TMM.2024.3521685","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521685","url":null,"abstract":"Essentially, directly introducing any object detection network to perform copy-move forgery detection (CMFD) inevitably leads to low detection accuracy. Therefore, DCM-Net, an object detection network dominated by diffusion model that incorporates the characteristics of copy-move forgery, is proposed in this paper for obviously enhancing CMFD performance. DCM-Net, as the first diffusion model-based CMFD network, has the following three improvements. Firstly, the high-similarity box padding strategy pads high-similarity boxes, rather than random boxes used in diffusion model, to ground truth boxes, better guiding subsequent dual-attention detection heads (DDHs) to focus more on high-similarity regions. Secondly, different from previous deep learning based CMFD networks that utilize self-correlation calculation to indiscriminately transform all classification features extracted from feature extraction into high-similarly features, an adaptive feature combination strategy is proposed to obtain the optimal feature transformation capable of achieving the best detection performance, enabling DDHs to more effectively distinguish source and target regions. Finally, to make detection heads have more accurate source/target localization and distinguishment, DDHs equipped with efficient multi-scale attention and contextual transformer, are proposed to generate tampered features fusing the entire precise spatial position information and rich contextual global information. The experimental results carried out on three publicly available datasets including USC-ISI, CoMoFoD, and COVERAGE, demonstrate that DCM-Net outperforms several advanced algorithms in terms of similarity detection ability and source/target differentiation ability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"503-514"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D Shape Segmentation With Potential Consistency Mining and Enhancement","authors":"Zhenyu Shu;Shiyang Li;Shiqing Xin;Ligang Liu","doi":"10.1109/TMM.2024.3521674","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521674","url":null,"abstract":"3D shape segmentation is a crucial task in the field of multimedia analysis and processing, and recent years have seen a surge in research on this topic. However, many existing methods only consider geometric features of 3D shapes and fail to explore the potential connections between faces, limiting their segmentation performance. In this paper, we propose a novel segmentation approach that mines and enhances the potential consistency of 3D shapes to overcome this limitation. The key idea is to mine the consistency between different partitions of 3D shapes and to use the unique consistency enhancement strategy to continuously optimize the consistency features for the network. Our method also includes a comprehensive set of network structures to mine and enhance consistent features, enabling more effective feature extraction and better utilization of contextual information around each face when processing complex shapes. We evaluate our approach on public benchmarks through extensive experiments and demonstrate its effectiveness in achieving higher accuracy than existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"133-144"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meijing Zhang;Mengxue Chen;Qi Li;Yanchen Chen;Rui Lin;Xiaolian Li;Shengfeng He;Wenxi Liu
{"title":"Category-Contrastive Fine-Grained Crowd Counting and Beyond","authors":"Meijing Zhang;Mengxue Chen;Qi Li;Yanchen Chen;Rui Lin;Xiaolian Li;Shengfeng He;Wenxi Liu","doi":"10.1109/TMM.2024.3521823","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521823","url":null,"abstract":"Crowd counting has drawn increasing attention across various fields. However, existing crowd counting tasks primarily focus on estimating the overall population, ignoring the behavioral and semantic information of different social groups within the crowd. In this paper, we aim to address a newly proposed research problem, namely fine-grained crowd counting, which involves identifying different categories of individuals and accurately counting them in static images. In order to fully leverage the categorical information in static crowd images, we propose a two-tier salient feature propagation module designed to sequentially extract semantic information from both the crowd and its surrounding environment. Additionally, we introduce a category difference loss to refine the feature representation by highlighting the differences between various crowd categories. Moreover, our proposed framework can adapt to a novel problem setup called few-example fine-grained crowd counting. This setup, unlike the original fine-grained crowd counting, requires only a few exemplar point annotations instead of dense annotations from predefined categories, making it applicable in a wider range of scenarios. The baseline model for this task can be established by substituting the loss function in our proposed model with a novel hybrid loss function that integrates point-oriented cross-entropy loss and category contrastive loss. Through comprehensive experiments, we present results in both the formulation and application of fine-grained crowd counting.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"477-488"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Ye;Zepeng Huang;Yilei Xiong;Yu Gao;Jinheng Xie;Linlin Shen
{"title":"Progressive Pseudo Labeling for Multi-Dataset Detection Over Unified Label Space","authors":"Kai Ye;Zepeng Huang;Yilei Xiong;Yu Gao;Jinheng Xie;Linlin Shen","doi":"10.1109/TMM.2024.3521841","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521841","url":null,"abstract":"Existing multi-dataset detection works mainly focus on the performance of detector on each of the datasets, with different label spaces. However, in real-world applications, a unified label space across multiple datasets is usually required. To address such a gap, we propose a progressive pseudo labeling (PPL) approach to detect objects across different datasets, over a unified label space. Specifically, we employ the widely used architecture of teacher-student model pair to jointly refine pseudo labels and train the unified object detector. The student model learns from both annotated labels and pseudo labels from the teacher model, which is updated by the exponential moving average (EMA) of the student. Three modules, i.e. Entropy-guided Adaptive Threshold (EAT), Global Classification Module (GCM) and Scene-Aware Fusion (SAF) strategy, are proposed to handle the noise of pseudo labels and fit the overall distribution. Extensive experiments are conducted on different multi-dataset benchmarks. The results demonstrate that our proposed method significantly outperforms the State-of-the-Art and is even comparable with supervised methods trained using annotations of all labels.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"531-543"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang
{"title":"Implicit and Explicit Language Guidance for Diffusion-Based Visual Perception","authors":"Hefeng Wang;Jiale Cao;Jin Xie;Aiping Yang;Yanwei Pang","doi":"10.1109/TMM.2024.3521825","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521825","url":null,"abstract":"Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU<inline-formula><tex-math>$^text{ss}$</tex-math></inline-formula> score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"466-476"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework","authors":"Haojin Deng;Yimin Yang","doi":"10.1109/TMM.2024.3521796","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521796","url":null,"abstract":"Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples, similar to self-supervised learning. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"429-441"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explain Vision Focus: Blending Human Saliency Into Synthetic Face Images","authors":"Kaiwei Zhang;Dandan Zhu;Xiongkuo Min;Huiyu Duan;Guangtao Zhai","doi":"10.1109/TMM.2024.3521670","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521670","url":null,"abstract":"Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"489-502"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}