{"title":"Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking","authors":"Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang;Jinjun Wang;Nanning Zheng","doi":"10.1109/TMM.2024.3394683","DOIUrl":"10.1109/TMM.2024.3394683","url":null,"abstract":"Multi-Object Tracking (MOT) remains a vital component of intelligent video analysis, which aims to locate targets and maintain a consistent identity for each target throughout a video sequence. Existing works usually learn a discriminative feature representation, such as motion and appearance, to associate the detections across frames, which are easily affected by mutual occlusion and background clutter in practice. In this paper, we propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets, so as to achieve robust data association in the tracking process. For the detections without being associated, we design a novel single-shot feature learning module to extract discriminative features of each detection, which can efficiently associate targets between adjacent frames. For the tracklets being lost several frames, we design a novel multi-shot feature learning module to extract discriminative features of each tracklet, which can accurately refind these lost targets after a long period. Once equipped with a simple data association logic, the resulting VisualTracker can perform robust MOT based on the single-shot and multi-shot feature representations. Extensive experimental results demonstrate that our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9515-9526"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruiheng Zhang;Jinyu Tan;Zhe Cao;Lixin Xu;Yumeng Liu;Lingyu Si;Fuchun Sun
{"title":"Part-Aware Correlation Networks for Few-Shot Learning","authors":"Ruiheng Zhang;Jinyu Tan;Zhe Cao;Lixin Xu;Yumeng Liu;Lingyu Si;Fuchun Sun","doi":"10.1109/TMM.2024.3394681","DOIUrl":"10.1109/TMM.2024.3394681","url":null,"abstract":"Few-shot learning brings the machine close to human thinking which enables fast learning with limited samples. Recent work considers local features to achieve contextual semantic complementation, while they are merely coarsened feature observations that can only extract insignificant label correlations. On the contrary, partial properties of few-shot examples significantly draw the implicit feature observations that can reveal the underlying label correlation of rare label classification. To fully explore the correlation between labels and partial features, this paper proposes a Part-Aware Correlation Network (PACNet) based on Partial Representation (PR) and Semantic Covariance Matrix (SCM). Specifically, we develop a partial representing module of an object that eliminates object-independent information and allows the model to focus on more distinctive parts. Furthermore, a semantic covariance measure function is redefined as a way to learn the semantic relationships of partial representations and to compute the partial similarity between the query sample and the support set. Experiments on three benchmark datasets consistently show that the proposed method outperforms the state-of-the-art counterparts, \u0000<italic>e.g.</i>\u0000, on the PartImageNet dataset, the performance gains of up to 12% and 5.9% are observed for the 5-way 1-shot and 5-way 5-shot settings, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9527-9538"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Oriented Knowledge Transfer for Cross-Domain Recommendation","authors":"Guoshuai Zhao;Xiaolong Zhang;Hao Tang;Jialie Shen;Xueming Qian","doi":"10.1109/TMM.2024.3394686","DOIUrl":"10.1109/TMM.2024.3394686","url":null,"abstract":"Cross-Domain Recommendation (CDR) aims to alleviate the cold-start problem by transferring knowledge from a data-rich domain (source domain) to a data-sparse domain (target domain), where knowledge needs to be transferred through a bridge connecting the two domains. Therefore, constructing a bridge connecting the two domains is fundamental for enabling cross-domain recommendation. However, existing CDR methods often overlook the valuable of natural relationships between items in connecting the two domains. To address this issue, we propose DKTCDR: a Domain-oriented Knowledge Transfer method for Cross-Domain Recommendation. In DKTCDR, We leverages the rich relationships between items in a cross-domain knowledge graph as bridges to facilitate both intra- and inter-domain knowledge transfer. Additionally, we design a cross-domain knowledge transfer strategy to enhance inter-domain knowledge transfer. Furthermore, we integrate the semantic modality information of items with the knowledge graph modality information to enhance item modeling. To support our investigation, we construct two high-quality cross-domain recommendation datasets, each containing a cross-domain knowledge graph. Our experimental results on these datasets validate the effectiveness of our proposed method. Source code is available at \u0000<uri>https://github.com/zxxxl123/DKTCDR</uri>\u0000.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9539-9550"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Group Multi-View Transformer for 3D Shape Analysis With Spatial Encoding","authors":"Lixiang Xu;Qingzhe Cui;Richang Hong;Wei Xu;Enhong Chen;Xin Yuan;Chenglong Li;Yuanyan Tang","doi":"10.1109/TMM.2024.3394731","DOIUrl":"10.1109/TMM.2024.3394731","url":null,"abstract":"In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the recognition performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9450-9463"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Ding;Haimin Zhang;Gang Fu;Caoqing Jiang;Fei Luo;Chunxia Xiao;Min Xu
{"title":"Towards High-Quality Photorealistic Image Style Transfer","authors":"Hong Ding;Haimin Zhang;Gang Fu;Caoqing Jiang;Fei Luo;Chunxia Xiao;Min Xu","doi":"10.1109/TMM.2024.3394733","DOIUrl":"10.1109/TMM.2024.3394733","url":null,"abstract":"Preserving important textures of the content image and achieving prominent style transfer results remains a challenge in the field of image style transfer. This challenge arises from the entanglement between color and texture during the style transfer process. To address this challenge, we propose an end-to-end network that incorporates adaptive weighted least squares (AWLS) filter, iterative least squares (ILS) filter, and channel separation. Given a content image (\u0000<inline-formula><tex-math>$mathcal {C}$</tex-math></inline-formula>\u0000) and a reference style image (\u0000<inline-formula><tex-math>$mathcal {S}$</tex-math></inline-formula>\u0000), we begin by separating the RGB channels and utilizing ILS filter to decompose them into structure and texture layers. We then perform style transfer on the structural layers using WCT\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000 (incorporating wavelet pooling and unpooling techniques for whitening and coloring transforms) in the R, G, and B channels, respectively. We address the texture distortion caused by WCT\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000 with a texture enhancing (TE) module in the structural layer. Furthermore, we propose an estimating and compensating for the structure loss (ECSL) module. In the ECSL module, with the AWLS filter and the ILS filter, we estimate the texture loss caused by TE, convert the loss of the structural layer to the loss of the texture layer, and compensate for the loss in the texture layer. The final structural layer and the texture layer are merged into the channel style transfer results in the separated R, G, and B channels into the final style transfer result. Thereby, this enables a more complete texture preservation and a significant style transfer process. To evaluate our method, we utilize quantitative experiments using various metrics, including NIQE, AG, SSIM, PSNR, and a user study. The experimental results demonstrate the superiority of our approach over the previous state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9892-9905"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Live 360° Video Streaming to Heterogeneous Clients in 5G Networks","authors":"Jacob Chakareski;Mahmudur Khan","doi":"10.1109/TMM.2024.3382910","DOIUrl":"10.1109/TMM.2024.3382910","url":null,"abstract":"We investigate rate-distortion-computing optimized live 360° video streaming to heterogeneous mobile VR clients in 5G networks. The client population comprises devices that feature single (LTE) or dual (LTE/NR) cellular connectivity. The content is compressed using scalable 360° tiling at the origin and sent towards the clients over a single backbone network link. A mobile edge server then adapts the incoming streaming data to the individual clients and their respective down-link transmission rates using formal rate-distortion-computing optimization. Single connectivity clients are served by the edge server a baseline representation/layer of the content adapted to their down-link transmission capacity and device computing capability. A dual connectivity client is served in parallel a baseline content layer on its LTE connectivity and a complementary viewport-specific enhancement layer on its NR connectivity, synergistically adapted to the respective down-links' transmission capacities and its computing capability. We formulate two optimization problems to conduct the operation of the edge server in each case, taking into account the key system components of the delivery process and induced end-to-end latency, aiming to maximize the immersion fidelity delivered to each client. We explore respective geometric programming optimization strategies that compute the optimal solutions at lower complexity. We rigorously analyze the computational complexity of the two optimization algorithms we formulate. In our evaluation, we demonstrate considerable performance gains over multiple assessment factors relative to two state-of-the-art techniques. We also examine the robustness of our approach to inaccurate user navigation prediction, transient NR link loss, dynamic LTE bandwidth variations, and diverse 360° video content. Finally, we contrast our results over five popular video quality metrics. The paper makes a community contribution by publicly sharing a dataset that captures the rate-quality trade-offs of the 360° video content used in our evaluation, for multiple contemporary quality metrics, to stimulate further studies and follow up work.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"8860-8873"},"PeriodicalIF":8.4,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140799990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei
{"title":"Multi-Level Pixel-Wise Correspondence Learning for 6DoF Face Pose Estimation","authors":"Miao Xu;Xiangyu Zhu;Yueying Kao;Zhiwen Chen;Jiangjing Lyu;Zhen Lei","doi":"10.1109/TMM.2024.3391888","DOIUrl":"10.1109/TMM.2024.3391888","url":null,"abstract":"In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of \u0000<inline-formula><tex-math>$MAE_{r}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$ADD$</tex-math></inline-formula>\u0000 on ARKitFace and a 4.0%/0.7% improvement of \u0000<inline-formula><tex-math>$MAE_{t}$</tex-math></inline-formula>\u0000 on ARKitFace/BIWI.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9423-9435"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmark Dataset and Pair-Wise Ranking Method for Quality Evaluation of Night-Time Image Enhancement","authors":"Xuejin Wang;Leilei Huang;Hangwei Chen;Qiuping Jiang;Shaowei Weng;Feng Shao","doi":"10.1109/TMM.2024.3391907","DOIUrl":"10.1109/TMM.2024.3391907","url":null,"abstract":"Night-time image enhancement (NIE) aims at boosting the intensity of low-light regions while suppressing noises or light effects in night-time images, and numerous efforts have been made for this task. However, few explorations focus on the quality evaluation issue of enhanced night-time images (ENTIs), and how to fairly compare the performance of different NIE algorithms remains a challenging problem. In this paper, we firstly construct a new Real-world Night-Time Image Enhancement Quality Assessment (i.e., RNTIEQA) dataset that includes two typical types of night-time scenes (i.e., extremely low light and uneven light scenes), and carry out human subjective studies to compare the quality of ENTIs obtained by a set of representative NIE algorithms. Afterwards, a new objective ranking method that comprehensively considering image intrinsic and impairment attributes is proposed for automatically predicting the quality of ENTIs. Experimental results on our RNTIEQA dataset demonstrate that the proposed method outperforms the off-the-shelf competitors. Our dataset and code will be released at \u0000<uri>https://github.com/Leilei-Huang-work/RNTIEQA-dataset</uri>\u0000.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9436-9449"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140637302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Deformable RoI Pooling and Semi-Decoupled Head for Object Detection","authors":"Bo Han;Lihuo He;Ying Yu;Wen Lu;Xinbo Gao","doi":"10.1109/TMM.2024.3391899","DOIUrl":"10.1109/TMM.2024.3391899","url":null,"abstract":"Object detection aims to classify interest objects within an image and pinpoint their positions using predicted rectangular bounding boxes. However, classification and localization tasks are heterogeneous, not only spatially misaligned but also differing in properties and feature requirements. Modern detectors commonly share the spatial region and detection head for both tasks, making them challenging to achieve optimal performance altogether, resulting in inconsistent accuracy. Specifically, the predicted bounding box may have higher classification confidence but lower localization quality, or vice versa. To tackle this issue, the spatial decoupling mechanism via general deformable RoI pooling is first proposed. This mechanism separately pursues the favorable regions for classification and localization, and subsequently extracts the corresponding features. Then, the semi-decoupled head is designed. Compared to the decoupled head that utilizes independent classification and localization networks, potentially leading to excessive decoupling and compromised detection performance, the semi-decoupled head enables the networks to mutually enhance each other while concentrating on their respective tasks. In addition, the semi-decoupled head also introduces a redundancy suppression module to filter out redundant task-irrelevant information of features extracted by separate networks and reinforce task-related information. By combining the spatial decoupling mechanism with the semi-decoupled head, the proposed detector achieves an impressive 43.7 AP in Faster R-CNN framework with ResNet-101 as backbone network. Without bells and whistles, extensive experimental results on the popular MS COCO dataset demonstrate that the proposed detector suppresses the baseline by a significant margin and outperforms some state-of-the-art detectors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9410-9422"},"PeriodicalIF":8.4,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140634510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}