{"title":"PMMTalk$:$ Speech-Driven 3D Facial Animation From Complementary Pseudo Multi-Modal Features","authors":"Tianshun Han;Shengnan Gui;Yiqing Huang;Baihui Li;Lijian Liu;Benjia Zhou;Ning Jiang;Quan Lu;Ruicong Zhi;Yanyan Liang;Du Zhang;Jun Wan","doi":"10.1109/TMM.2024.3521701","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521701","url":null,"abstract":"Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary <bold>P</b>seudo <bold>M</b>ulti-<bold>M</b>odal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale <bold>3D</b> <bold>C</b>hinese <bold>A</b>udio-<bold>V</b>isual <bold>F</b>acial <bold>A</b>nimation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2570-2581"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics","authors":"Shuzhao Xie;Yuan Xue;Yifei Zhu;Zhi Wang","doi":"10.1109/TMM.2024.3521768","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521768","url":null,"abstract":"The advent of deep learning has precipitated a surge in public machine learning as a service (MLaaS) for multimedia analysis. However, reliance on a single MLaaS can result in product dependency and a loss of better performance offered by multiple MLaaSes. Consequently, many enterprises opt for an intercloud broker capable of managing jobs across various clouds. Though existing works explore the efficient utilization of inter-cloud computational resources and the enhancement of inter-cloud data transfer throughput, they disregard improving the overall accuracy of multiple MLaaSes. In response, we conduct a measurement study on object detection services, which are designed to identify and locate various objects within an image. We discover that combining predictions from multiple MLaaSes can improve analytical performance. However, more MLaaSes do not necessarily equate to better performance. Therefore, we propose SkyML, a user-side MLaaS federation broker that selects a subset of MLaaSes based on the characteristics of the request to achieve optimal multimedia analytical performance. Initially, we design a combinatorial reinforcement learning approach to select the sound MLaaS combination, thereby maximizing user experience. We also present an ingenious, automated taxonomy unification algorithm to minimize human efforts in merging MLaaS-specific labels into a user-preferred label space. Moreover, we devise an optimized ensemble strategy to aggregate predictions from the selected MLaaSes. Evaluations indicate that our similarity-based taxonomy unification approach can reduce annotation costs by 90%. Moreover, real-world trace-driven evaluations further prove that our MLaaS selection method can achieve similar levels of accuracy with a 67% reduction in inference fees.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2463-2476"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Yang;Jun Li;Yunong Cai;Guoming Wu;Zhiping Shi;Chaodong Tan;Xianglong Liu
{"title":"Hard-Sample Style Guided Patch Attack With RL-Enhanced Motion Pattern for Video Recognition","authors":"Jian Yang;Jun Li;Yunong Cai;Guoming Wu;Zhiping Shi;Chaodong Tan;Xianglong Liu","doi":"10.1109/TMM.2024.3521832","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521832","url":null,"abstract":"Adversarial attacks have been extensively studied in the image field. In recent years, research has shown that video recognition models are also vulnerable to adversarial examples. However, most studies about adversarial attacks for video models have focused on perturbation-based methods, while patch-based black-box attacks have received less attention. Despite the excellent performance of perturbation-based attacks, these attacks are impractical for real-world implementation. Most existing patch-based black-box attacks require occluding larger areas and performing more queries to the target model. In this paper, we propose a hard-sample style guided patch attack with reinforcement learning (RL) enhanced motion patterns for video recognition (HSPA). Specifically, we utilize the style features of video hard samples and transfer their multi-dimensional style features to images to obtain a texture patch set. Then we use reinforcement learning to locate the patch coordinates and obtain a specific adversarial motion pattern of the patch to successfully perform an effective attack on a video recognition model in both the spatial and temporal dimensions. Our experiments on three widely-used video action recognition models (C3D, LRCN, and TDN) and two mainstream datasets (UCF-101 and HMDB-51) demonstrate the superior performance of our method compared to other state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1205-1215"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VLAB: Enhancing Video Language Pretraining by Feature Adapting and Blending","authors":"Xingjian He;Sihan Chen;Fan Ma;Zhicheng Huang;Xiaojie Jin;Zikang Liu;Dongmei Fu;Yi Yang;Jing Liu;Jiashi Feng","doi":"10.1109/TMM.2024.3521729","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521729","url":null,"abstract":"Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: <bold>V</b>ideo <bold>L</b>anguage pre-training by feature <bold>A</b>dapting and <bold>B</b>lending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 60.9, and 79.0, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2168-2180"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Fu;Xinyu Zhu;Xiaojie Li;Xin Wang;Xi Wu;Shu Hu;Yi Wu;Siwei Lyu;Wei Liu
{"title":"VB-KGN: Variational Bayesian Kernel Generation Networks for Motion Image Deblurring","authors":"Ying Fu;Xinyu Zhu;Xiaojie Li;Xin Wang;Xi Wu;Shu Hu;Yi Wu;Siwei Lyu;Wei Liu","doi":"10.1109/TMM.2024.3521805","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521805","url":null,"abstract":"Motion blur estimation is a critical and fundamental task in scene analysis and image restoration. While most state-of-the-art deep learning-based methods for single-image motion image deblurring focus on constructing deep networks or developing training strategies, the characterization of motion blur has received less attention. In this paper, we innovatively propose a non-parametric Variational Bayesian Kernel Generation Network (VB-KGN) for characterizing motion blur in a single image. To solve this model, we employ the variational inference framework to approximate the expected statistical distribution of motion blur images in a data-driven manner. The qualitative and quantitative evaluations of our experimental results demonstrate that our proposed model can generate highly accurate motion blur kernels, significantly improving motion image deblurring performance and substantially reducing the need for extensive training sample preprocessing for deblurring tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2028-2042"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Liang;Lingyuan Meng;Hao Li;Meng Liu;Siwei Wang;Sihang Zhou;Xinwang Liu;Kunlun He
{"title":"MGKsite: Multi-Modal Knowledge-Driven Site Selection via Intra and Inter-Modal Graph Fusion","authors":"Ke Liang;Lingyuan Meng;Hao Li;Meng Liu;Siwei Wang;Sihang Zhou;Xinwang Liu;Kunlun He","doi":"10.1109/TMM.2024.3521742","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521742","url":null,"abstract":"Site selection aims to select optimal locations for new stores, which is crucial in business management and urban computing. The early data-driven models heavily relied on feature engineering, which could not effectively model the complex relationships and diverse influences among different data. To alleviate such issues, the knowledge-driven paradigm is proposed based on urban knowledge graphs (KGs). However, the research on them is at an early stage. They omit extra multi-modal information corresponding to brands and stores due to two main challenges, i.e., (1) building available datasets, and (2) designing effective models. It constrains the expressive ability and practical value of previous models. To this end, we first construct new multi-modal urban KGs for site selection with three extra modal (i.e., visual, textual, and acoustic) attributes. Then, we propose a novel multi-modal knowledge-driven model (MGKsite). Concretely, a graph neural network (GNN) based fusion network is designed to fuse the features based on the attribute K-Nearest Neighbor (KNN) graph, which models both intra and inter-modal correlations among the features. The fused embeddings are further injected into the knowledge-driven backbones for learning and inference. Experiments prove promising capacities of MGKsite from five aspects, i.e., superiority, effectiveness, sensitivity, transferability and complexity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1722-1735"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Vision Anomaly Detection With the Guidance of Language Modality","authors":"Dong Chen;Kaihang Pan;Guangyu Dai;Guoming Wang;Yueting Zhuang;Siliang Tang;Mingliang Xu","doi":"10.1109/TMM.2024.3521813","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521813","url":null,"abstract":"Recent years have seen a surge of interest in anomaly detection. However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. In contrast, anomaly detectors demonstrate superior performance in the language modality due to the unimodal nature of the data. This paper tackles the aforementioned challenges for vision modality from a multimodal point of view. Specifically, we propose Cross-modal Guidance (CMG), comprising of Cross-modal Entropy Reduction (CMER) and Cross-modal Linear Embedding (CMLE), to address the issues of redundant information and sparse latent space, respectively. CMER involves masking portions of the raw image and computing the matching score with the corresponding text. Essentially, CMER eliminates irrelevant pixels to direct the detector's focus towards critical content. To learn a more compact latent space for the vision anomaly detection, CMLE learns a correlation structure matrix from the language modality. Then, the acquired matrix compels the distribution of images to resemble that of texts in the latent space. Extensive experiments demonstrate the effectiveness of the proposed methods. Particularly, compared to the baseline that only utilizes images, the performance of CMG has been improved by 16.81%. Ablation experiments further confirm the synergy among the proposed CMER and CMLE, as each component depends on the other to achieve optimal performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1410-1419"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection","authors":"Xu Han;Junyu Gao;Chuang Yang;Yuan Yuan;Qi Wang","doi":"10.1109/TMM.2024.3521797","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521797","url":null,"abstract":"Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"287-299"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Category-Level Multi-Object 9D State Tracking Using Object-Centric Multi-Scale Transformer in Point Cloud Stream","authors":"Jingtao Sun;Yaonan Wang;Mingtao Feng;Xiaofeng Guo;Huimin Lu;Xieyuanli Chen","doi":"10.1109/TMM.2024.3521664","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521664","url":null,"abstract":"Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1072-1085"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiating Jin;Jiajun Bu;Zhi Yu;Hui Zhang;Yaonan Wang
{"title":"Federated Hallucination Translation and Source-Free Regularization Adaptation in Decentralized Domain Adaptation for Foggy Scene Understanding","authors":"Xiating Jin;Jiajun Bu;Zhi Yu;Hui Zhang;Yaonan Wang","doi":"10.1109/TMM.2024.3521711","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521711","url":null,"abstract":"Semantic foggy scene understanding (SFSU) emerges a challenging task under out-of-domain distribution (OD) due to uncertain cognition caused by degraded visibility. With the strong assumption of data centralization, unsupervised domain adaptation (UDA) reduces vulnerability under OD scenario. Whereas, enlarged domain gap and growing privacy concern heavily challenge conventional UDA. Motivated by gap decomposition and data decentralization, we establish a decentralized domain adaptation (DDA) framework called <bold><u>T</u></b>ranslate th<bold><u>E</u></b>n <bold><u>A</u></b>dapt (abbr. <bold><u>TEA</u></b>) for privacy preservation. Our highlights lie in. (1) Regarding federated hallucination translation, a <bold><u>Dis</u></b>entanglement and <bold><u>Co</u></b>ntrastive-learning based <bold><u>G</u></b>enerative <bold><u>A</u></b>dversarial <bold><u>N</u></b>etwork (abbr. <bold><u>DisCoGAN</u></b>) is proposed to impose contrastive prior and disentangle latent space in cycle-consistent translation. To yield domain hallucination, client minimizes cross-entropy of local classifier but maximizes entropy of global model to train translator. (2) Regarding source-free regularization adaptation, a <bold><u>Pro</u></b>totypical-knowledge based <bold><u>R</u></b>egularization <bold><u>A</u></b>daptation (abbr. <bold><u>ProRA</u></b>) is presented to align joint distribution in output space. Soft adversarial learning relaxes binary label to rectify inter-domain discrepancy and inner-domain divergence. Structure clustering and entropy minimization drive intra-class features closer and inter-class features apart. Extensive experiments exhibit efficacy of our TEA which achieves 55.26% or 46.25% mIoU in adaptation from GTA5 to Foggy Cityscapes or Foggy Zurich, outperforming other DDA methods for SFSU.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1601-1616"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}