{"title":"SQL-Net: Semantic Query Learning for Point-Supervised Temporal Action Localization","authors":"Yu Wang;Shengjie Zhao;Shiwei Chen","doi":"10.1109/TMM.2024.3521799","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521799","url":null,"abstract":"Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"84-94"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LININ: Logic Integrated Neural Inference Network for Explanatory Visual Question Answering","authors":"Dizhan Xue;Shengsheng Qian;Quan Fang;Changsheng Xu","doi":"10.1109/TMM.2024.3521709","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521709","url":null,"abstract":"Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task consisting of answering the visual question and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) task that only aims at predicting answers for visual questions, EVQA also aims to generate user-friendly explanations to improve the explainability and credibility of reasoning models. To date, existing methods for VQA and EVQA ignore the prompt in the question and enforce the model to predict the probabilities of all answers. Moreover, existing EVQA methods ignore the complex relationships among question words, visual regions, and explanation tokens. Therefore, in this work, we propose a Logic Integrated Neural Inference Network (LININ) to restrict the range of candidate answers based on first-order-logic (FOL) and capture cross-modal relationships to generate rational explanations. Firstly, we design a FOL-based question analysis program to fetch a small number of candidate answers. Secondly, we utilize a multimodal transformer encoder to extract visual and question features, and conduct the prediction on candidate answers. Finally, we design a multimodal explanation transformer to construct cross-modal relationships and generate rational explanations. Comprehensive experiments on benchmark datasets demonstrate the superiority of LININ compared with the state-of-the-art methods for EVQA.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"16-27"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Progressive Region-to-Boundary Exploration Network for Camouflaged Object Detection","authors":"Guanghui Yue;Shangjie Wu;Tianwei Zhou;Gang Li;Jie Du;Yu Luo;Qiuping Jiang","doi":"10.1109/TMM.2024.3521761","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521761","url":null,"abstract":"Camouflaged object detection (COD) aims to segment targeted objects that have similar colors, textures, or shapes to their background environment. Due to the limited ability in distinguishing highly similar patterns, existing COD methods usually produce inaccurate predictions, especially around the boundary areas, when coping with complex scenes. This paper proposes a Progressive Region-to-Boundary Exploration Network (PRBE-Net) to accurately detect camouflaged objects. PRBE-Net follows an encoder-decoder framework and includes three key modules. Specifically, firstly, both high-level and low-level features of the encoder are integrated by a region and boundary exploration module to explore their complementary information for extracting the object's coarse region and fine boundary cues simultaneously. Secondly, taking the region cues as the guidance information, a Region Enhancement (RE) module is used to adaptively localize and enhance the region information at each layer of the encoder. Subsequently, considering that camouflaged objects usually have blurry boundaries, a Boundary Refinement (BR) decoder is used after the RE module to better detect the boundary areas with the assistance of boundary cues. Through top-down deep supervision, PRBE-Net can progressively refine the prediction. Extensive experiments on four datasets indicate that our PRBE-Net achieves superior results over 21 state-of-the-art COD methods. Additionally, it also shows good results on polyp segmentation, a COD-related task in the medical field.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"236-248"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-Enhanced Confidence Calibration for Class-Incremental Unsupervised Domain Adaptation","authors":"Jiaping Yu;Muli Yang;Aming Wu;Cheng Deng","doi":"10.1109/TMM.2024.3521834","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521834","url":null,"abstract":"In this paper, we focus on Class-Incremental Unsupervised Domain Adaptation (CI-UDA), where the labeled source domain already includes all classes, and the classes in the unlabeled target domain emerge sequentially over time. This task involves addressing two main challenges. The first is the domain gap between the labeled source data and the unlabeled target data, which leads to weak generalization performance. The second is the inconsistency between the source and target category spaces at each time step, which causes catastrophic forgetting during the testing stage. Previous methods focus solely on the alignment of similar samples from different domains, which overlooks the underlying causes of the domain gap/class distribution difference. To tackle the issue, we rethink this task from a causal perspective for the first time. We first build a structural causal graph to describe the CI-UDA problem. Based on the causal graph, we present Memory-Enhanced Confidence Calibration (MECC), which aims to improve confidence in the predicted results. In particular, we argue that the domain discrepancy caused by the different styles is prone to make the model produce less confident predictions and thus weakens the generalization and continual learning abilities. To this end, we first explore using the gram matrix to generate source-style target data, which is combined with the original data to jointly train the model and thereby reduce the domain-shift impact. Second, we utilize the model of the previous time step to select corresponding samples that are used to build a memory bank, which is instrumental in alleviating catastrophic forgetting. Extensive experimental results on multiple datasets demonstrate the superiority of our method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"610-621"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Masked Attribute Description Embedding for Cloth-Changing Person Re-Identification","authors":"Chunlei Peng;Boyu Wang;Decheng Liu;Nannan Wang;Ruimin Hu;Xinbo Gao","doi":"10.1109/TMM.2024.3521730","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521730","url":null,"abstract":"Cloth-changing person re-identification (CC-ReID) aims to match persons who change clothes over long periods. The key challenge in CC-ReID is to extract cloth-irrelated features, such as face, hairstyle, body shape, and gait. Current research mainly focuses on modeling body shape using multi-modal biological features (such as silhouettes and sketches). However, it does not fully leverage the personal description information hidden in the original RGB image. Considering that there are certain attribute descriptions that remain unchanged after the changing of cloth, we propose a Masked Attribute Description Embedding (MADE) method that unifies personal visual appearance and attribute description for CC-ReID. Specifically, handling variable cloth-sensitive information, such as color and type, is challenging for effective modeling. To address this, we mask the clothes type and color information (upper body type, upper body color, lower body type, and lower body color) in the personal attribute description extracted through an attribute detection model. The masked attribute description is then connected and embedded into Transformer blocks at various levels, fusing it with the low-level to high-level features of the image. This approach compels the model to discard cloth information. Experiments are conducted on several CC-ReID benchmarks, including PRCC, LTCC, Celeb-reID-light, and LaST. Results demonstrate that MADE effectively utilizes attribute description, enhancing cloth-changing person re-identification performance, and compares favorably with state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1475-1485"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Between/Within View Information Completing for Tensorial Incomplete Multi-View Clustering","authors":"Mingze Yao;Huibing Wang;Yawei Chen;Xianping Fu","doi":"10.1109/TMM.2024.3521771","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521771","url":null,"abstract":"Incomplete Multi-view Clustering (IMvC) receives increasing attention due to its effectiveness in solving data-missing problems. With the information loss in incomplete situations, the core of IMvC needs to consider effectively overcoming the challenge of missing views, that is, exploring the underlying correlations from available data and recovering the missing information. However, most existing IMvC methods overemphasize the recovery-first principle with integrating the existing data from different views while neglecting the influence of view consistency in IMvC task together with valuable within view information. In this paper, a novel Between/Within View Information Completing for Tensorial Incomplete Multi-view Clustering (BWIC-TIMC) has been proposed, in which between/within view information is jointly exploited for effectively completing the missing views. Specifically, the proposed method designs a dual tensor constraint module, which focuses on simultaneously exploring the view-specific correlations of incomplete views and enforcing the between view consistency across different views. With the dual tensor constraint, between/within view information can be effectively integrated for completing missing views for IMvC task. Furthermore, in order to balance different contributions of multiple views and alleviate the problem of feature degeneration, BWIC-TIMC implements an adaptive fusion graph learning strategy for consensus representation learning. Extensive comparative experiments with the-state-of-art baselines can demonstrate the effectiveness of BWIC-TIMC.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1538-1550"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCM-Net: A Diffusion Model-Based Detection Network Integrating the Characteristics of Copy-Move Forgery","authors":"Shaowei Weng;Jianhao Zhang;Tanguo Zhu;Lifang Yu;Chunyu Zhang","doi":"10.1109/TMM.2024.3521685","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521685","url":null,"abstract":"Essentially, directly introducing any object detection network to perform copy-move forgery detection (CMFD) inevitably leads to low detection accuracy. Therefore, DCM-Net, an object detection network dominated by diffusion model that incorporates the characteristics of copy-move forgery, is proposed in this paper for obviously enhancing CMFD performance. DCM-Net, as the first diffusion model-based CMFD network, has the following three improvements. Firstly, the high-similarity box padding strategy pads high-similarity boxes, rather than random boxes used in diffusion model, to ground truth boxes, better guiding subsequent dual-attention detection heads (DDHs) to focus more on high-similarity regions. Secondly, different from previous deep learning based CMFD networks that utilize self-correlation calculation to indiscriminately transform all classification features extracted from feature extraction into high-similarly features, an adaptive feature combination strategy is proposed to obtain the optimal feature transformation capable of achieving the best detection performance, enabling DDHs to more effectively distinguish source and target regions. Finally, to make detection heads have more accurate source/target localization and distinguishment, DDHs equipped with efficient multi-scale attention and contextual transformer, are proposed to generate tampered features fusing the entire precise spatial position information and rich contextual global information. The experimental results carried out on three publicly available datasets including USC-ISI, CoMoFoD, and COVERAGE, demonstrate that DCM-Net outperforms several advanced algorithms in terms of similarity detection ability and source/target differentiation ability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"503-514"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSDLF-K: A Multimodal Feature Learning Approach for Sentiment Analysis in Korean Incorporating Text and Speech","authors":"Tae-Young Kim;Jufeng Yang;Eunil Park","doi":"10.1109/TMM.2024.3521707","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521707","url":null,"abstract":"Recently, sentiment analysis research has made significant improvements in addressing sentiment and subjectivity within textual content. The advent of multimodal deep learning techniques has further broadened this scope, enabling the integration of diverse modalities such as voice and image features alongside text. However, despite these advancements, the analysis of the Korean language remains challenging due to its inherently agglutinative nature and linguistic ambiguity, primarily examined at the sentence level. To effectively address this challenge, we propose a novel Multimodal Sentimental Deep Learning Framework for Korean (MSDLF-K), which can examine not only Korean text but also its associated speech. Our framework, MSDLF-K, integrates spectrograms and waveforms from Korean voice data with embedding vectors derived from script sentences, creating a unified multimodal representation. This approach facilitates the identification of both shared and unique features within the latent space, thereby offering valuable insights into their respective impacts on sentiment analysis performance. To validate the efficacy of MSDLF-K, we conducted a set of experiments using the emotion speech synthesis dataset. Our findings demonstrate that MSDLF-K achieves a remarkable accuracy of 79.0% in valence and 81.7% in arousal for emotion classification, metrics previously unexplored in the literature. Furthermore, empirical analysis reveals the significant influence of multimodal representations, encompassing both text and voice, on enhancing emotion analysis performance. In summary, our study not only presents a pioneering solution for sentiment analysis in the Korean language but also underscores the importance of incorporating multimodal approaches for more comprehensive and accurate sentiment analysis across diverse linguistic contexts.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1266-1276"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Dang;Gang Liu;Hao Li;Di Wang;Rong Pan;Quan Wang
{"title":"PRA-Det: Anchor-Free Oriented Object Detection With Polar Radius Representation","authors":"Min Dang;Gang Liu;Hao Li;Di Wang;Rong Pan;Quan Wang","doi":"10.1109/TMM.2024.3521683","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521683","url":null,"abstract":"Oriented object detection typically adds an additional rotation angle to the regressed horizontal bounding box (HBB) for representing the oriented bounding box (OBB). However, existing oriented object detectors based on regression angles face inconsistency between metric and loss, boundary discontinuity or square-like problems. To solve the above problems, we propose an anchor-free oriented object detector named PRA-Det, which assigns the center region of the object to regress OBBs represented by the polar radius vectors. Specifically, the proposed PRA-Det introduces a diamond-shaped positive region of category-wise attention factor to assign positive sample points to regress polar radius vectors. PRA-Det regresses the polar radius vector of the edges from the assigned sample points as the regression target and suppresses the predicted low-quality polar radius vectors through the category-wise attention factor. The OBBs defined for different protocols are uniformly encoded by the polar radius encoding module into regression targets represented by polar radius vectors. Therefore, the regression target represented by the polar radius vector does not have angle parameters during training, thus solving the angle-sensitive boundary discontinuity and square-like problems. To optimize the predicted polar radius vector, we design a spatial geometry loss to improve the detection accuracy. Furthermore, in the inference stage, the center offset score of the polar radius vector is combined with the classification score as the confidence to alleviate the inconsistency between classification and regression. The extensive experiments on public benchmarks demonstrate that the PRA-Det is highly competitive with state-of-the-art oriented object detectors and outperforms other comparison methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"145-157"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quality-Guided Skin Tone Enhancement for Portrait Photography","authors":"Shiqi Gao;Huiyu Duan;Xinyue Li;Kang Fu;Yicong Peng;Qihang Xu;Yuanyuan Chang;Jia Wang;Xiongkuo Min;Guangtao Zhai","doi":"10.1109/TMM.2024.3521829","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521829","url":null,"abstract":"In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image continuously, since in many cases we may want to get a slighter or stronger enhancement effect rather than one fixed adjusted result. In this paper, we propose a quality-guided image enhancement paradigm that enables image enhancement models to learn the distribution of images with various quality ratings. By learning this distribution, image enhancement models can associate image features with their corresponding perceptual qualities, which can be used to adjust images continuously according to different quality scores. To validate the effectiveness of our proposed method, a subjective quality assessment experiment is first conducted, focusing on skin tone adjustment in portrait photography. Guided by the subjective quality ratings obtained from this experiment, our method can adjust the skin tone corresponding to different quality requirements. Furthermore, an experiment conducted on 10 natural raw images corroborates the effectiveness of our model in situations with fewer subjects and fewer shots, and also demonstrates its general applicability to natural images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"171-185"},"PeriodicalIF":8.4,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}