Yanyang Xiao, Tieyi Zhang, Juan Cao, Zhonggui Chen
{"title":"Accelerated Lloyd's Method for Resampling 3D Point Clouds","authors":"Yanyang Xiao, Tieyi Zhang, Juan Cao, Zhonggui Chen","doi":"10.1109/tmm.2024.3405664","DOIUrl":"https://doi.org/10.1109/tmm.2024.3405664","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"2016 1","pages":""},"PeriodicalIF":7.3,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bipartite Graph-Based Projected Clustering With Local Region Guidance for Hyperspectral Imagery","authors":"Yongshan Zhang;Guozhu Jiang;Zhihua Cai;Yicong Zhou","doi":"10.1109/TMM.2024.3394975","DOIUrl":"10.1109/TMM.2024.3394975","url":null,"abstract":"Hyperspectral image (HSI) clustering is challenging to divide all pixels into different clusters because of the absent labels, large spectral variability and complex spatial distribution. Anchor strategy provides an attractive solution to the computational bottleneck of graph-based clustering for large HSIs. However, most existing methods require separated learning procedures and ignore noisy as well as spatial information. In this paper, we propose a bipartite graph-based projected clustering (BGPC) method with local region guidance for HSI data. To take full advantage of spatial information, HSI denoising to alleviate noise interference and anchor initialization to construct bipartite graph are conducted within each generated superpixel. With the denoised pixels and initial anchors, projection learning and structured bipartite graph learning are simultaneously performed in a one-step learning model with connectivity constraint to directly provide clustering results. An alternating optimization algorithm is devised to solve the formulated model. The advantage of BGPC is the joint learning of projection and bipartite graph with local region guidance to exploit spatial information and linear time complexity to lessen computational burden. Extensive experiments demonstrate the superiority of the proposed BGPC over the state-of-the-art HSI clustering methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9551-9563"},"PeriodicalIF":8.4,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Each Performs Its Functions: Task Decomposition and Feature Assignment for Audio-Visual Segmentation","authors":"Sen Xu;Shikui Wei;Tao Ruan;Lixin Liao;Yao Zhao","doi":"10.1109/TMM.2024.3394682","DOIUrl":"10.1109/TMM.2024.3394682","url":null,"abstract":"Audio-visual segmentation (AVS) aims to segment the object instances that produce sound at the time of the video frames. Existing related solutions focus on designing cross-modal interaction mechanisms, which try to learn audio-visual correlations and simultaneously segment objects. Despite effectiveness, the close-coupling network structures become increasingly complex and hard to analyze. To address these problems, we propose a simple but effective method, ‘Each \u0000<underline>P</u>\u0000erforms \u0000<underline>I</u>\u0000ts \u0000<underline>F</u>\u0000unctions (PIF),’ which focuses on task decomposition and feature assignment. Inspired by human sensory experiences, PIF decouples AVS into two subtasks, correlation learning, and segmentation refinement, via two branches. Correlation learning aims to learn the correspondence between sound and visible individuals and provide the positional prior. Segmentation refinement focuses on fine segmentation. Then we assign different level features to perform the appropriate duties, i.e., using deep features for cross-modal interaction due to their semantic advantages; using rich textures of shallow features to improve segmentation results. Moreover, we propose the recurrent collaboration block to enhance interbranch communication. Experimental results on AVSBench show that our method outperforms related state-of-the-art methods by a large margin (e.g., +6.0% mIoU and +7.6% F-score on the Multi-Source subset). In addition, by purposely boosting subtasks' performance, our approach can serve as a strong baseline for audio-visual segmentation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9489-9498"},"PeriodicalIF":8.4,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Zhang;Yifan Wang;Ran Song;Mingxin Zhang;Xiaolei Li;Wei Zhang
{"title":"Neighborhood-Aware Mutual Information Maximization for Source-Free Domain Adaptation","authors":"Lin Zhang;Yifan Wang;Ran Song;Mingxin Zhang;Xiaolei Li;Wei Zhang","doi":"10.1109/TMM.2024.3394971","DOIUrl":"10.1109/TMM.2024.3394971","url":null,"abstract":"Recently, the source-free domain adaptation (SFDA) problem has attracted much attention, where the pre-trained model for the source domain is adapted to the target domain in the absence of source data. However, due to domain shift, the negative alignment usually exists between samples from the same class, which may lower intra-class feature similarity. To address this issue, we present a self-supervised representation learning strategy for SFDA, named as neighborhood-aware mutual information (NAMI), which maximizes the mutual information (MI) between the representations of target samples and their corresponding neighbors. Moreover, we theoretically demonstrate that NAMI can be decomposed into a weighted sum of local MI, which suggests that the weighted terms can better estimate NAMI. To this end, we introduce neighborhood consensus score over the set of weakly and strongly augmented views and point-wise density based on neighborhood, both of which determine the weights of local MI for NAMI by leveraging the neighborhood information of samples. The proposed method can significantly handle domain shift and adaptively reduce the noise in the neighborhood of each target sample. In combination with the consistency loss over views, NAMI leads to consistent improvement over existing state-of-the-art methods on three popular SFDA benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9564-9574"},"PeriodicalIF":8.4,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-Enhanced Proxy-Guided Hashing for Long-Tailed Image Retrieval","authors":"Hongtao Xie;Yan Jiang;Lei Zhang;Pandeng Li;Dongming Zhang;Yongdong Zhang","doi":"10.1109/TMM.2024.3394684","DOIUrl":"10.1109/TMM.2024.3394684","url":null,"abstract":"Hashing has been studied extensively for large-scale image retrieval due to its efficient computation and storage. Deep hashing methods typically train models with category-balanced data and suffer from a serious performance deterioration when dealing with long-tailed training samples. Recently, several long-tailed hashing methods focus on this newly emerging field for practical purpose. However, existing methods still face challenges that fixed category centers with limited semantic information cannot effectively improve the discriminative ability of tail-category hash codes. To tackle the issue, we propose a novel method called Semantic-enhanced Proxy-guided Hashing in this paper. We leverage two sets of learnable category proxies in the feature space and the Hamming space respectively, which can describe category semantics by getting updated continuously along with the whole model via back-propagation. Based on this, we introduce the Mahalanobis distance metric to characterize relationships accurately and enhance the semantic representation of both proxies and samples concurrently, improving the hash learning process. Moreover, we capture the multilateral correlations between proxies and samples in the feature space and extend a hypergraph neural network to transfer semantic knowledge from proxies to samples in the Hamming space. Extensive experiments show that our method achieves the state-of-the-art performance and surpasses existing methods by 1.47%–7.56% MAP on long-tailed benchmarks, demonstrating the superiority of learnable category proxies and the effectiveness of our proposed learning algorithm for long-tailed hashing.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9499-9514"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking","authors":"Yizhe Li;Sanping Zhou;Zheng Qin;Le Wang;Jinjun Wang;Nanning Zheng","doi":"10.1109/TMM.2024.3394683","DOIUrl":"10.1109/TMM.2024.3394683","url":null,"abstract":"Multi-Object Tracking (MOT) remains a vital component of intelligent video analysis, which aims to locate targets and maintain a consistent identity for each target throughout a video sequence. Existing works usually learn a discriminative feature representation, such as motion and appearance, to associate the detections across frames, which are easily affected by mutual occlusion and background clutter in practice. In this paper, we propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets, so as to achieve robust data association in the tracking process. For the detections without being associated, we design a novel single-shot feature learning module to extract discriminative features of each detection, which can efficiently associate targets between adjacent frames. For the tracklets being lost several frames, we design a novel multi-shot feature learning module to extract discriminative features of each tracklet, which can accurately refind these lost targets after a long period. Once equipped with a simple data association logic, the resulting VisualTracker can perform robust MOT based on the single-shot and multi-shot feature representations. Extensive experimental results demonstrate that our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9515-9526"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruiheng Zhang;Jinyu Tan;Zhe Cao;Lixin Xu;Yumeng Liu;Lingyu Si;Fuchun Sun
{"title":"Part-Aware Correlation Networks for Few-Shot Learning","authors":"Ruiheng Zhang;Jinyu Tan;Zhe Cao;Lixin Xu;Yumeng Liu;Lingyu Si;Fuchun Sun","doi":"10.1109/TMM.2024.3394681","DOIUrl":"10.1109/TMM.2024.3394681","url":null,"abstract":"Few-shot learning brings the machine close to human thinking which enables fast learning with limited samples. Recent work considers local features to achieve contextual semantic complementation, while they are merely coarsened feature observations that can only extract insignificant label correlations. On the contrary, partial properties of few-shot examples significantly draw the implicit feature observations that can reveal the underlying label correlation of rare label classification. To fully explore the correlation between labels and partial features, this paper proposes a Part-Aware Correlation Network (PACNet) based on Partial Representation (PR) and Semantic Covariance Matrix (SCM). Specifically, we develop a partial representing module of an object that eliminates object-independent information and allows the model to focus on more distinctive parts. Furthermore, a semantic covariance measure function is redefined as a way to learn the semantic relationships of partial representations and to compute the partial similarity between the query sample and the support set. Experiments on three benchmark datasets consistently show that the proposed method outperforms the state-of-the-art counterparts, \u0000<italic>e.g.</i>\u0000, on the PartImageNet dataset, the performance gains of up to 12% and 5.9% are observed for the 5-way 1-shot and 5-way 5-shot settings, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9527-9538"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Oriented Knowledge Transfer for Cross-Domain Recommendation","authors":"Guoshuai Zhao;Xiaolong Zhang;Hao Tang;Jialie Shen;Xueming Qian","doi":"10.1109/TMM.2024.3394686","DOIUrl":"10.1109/TMM.2024.3394686","url":null,"abstract":"Cross-Domain Recommendation (CDR) aims to alleviate the cold-start problem by transferring knowledge from a data-rich domain (source domain) to a data-sparse domain (target domain), where knowledge needs to be transferred through a bridge connecting the two domains. Therefore, constructing a bridge connecting the two domains is fundamental for enabling cross-domain recommendation. However, existing CDR methods often overlook the valuable of natural relationships between items in connecting the two domains. To address this issue, we propose DKTCDR: a Domain-oriented Knowledge Transfer method for Cross-Domain Recommendation. In DKTCDR, We leverages the rich relationships between items in a cross-domain knowledge graph as bridges to facilitate both intra- and inter-domain knowledge transfer. Additionally, we design a cross-domain knowledge transfer strategy to enhance inter-domain knowledge transfer. Furthermore, we integrate the semantic modality information of items with the knowledge graph modality information to enhance item modeling. To support our investigation, we construct two high-quality cross-domain recommendation datasets, each containing a cross-domain knowledge graph. Our experimental results on these datasets validate the effectiveness of our proposed method. Source code is available at \u0000<uri>https://github.com/zxxxl123/DKTCDR</uri>\u0000.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9539-9550"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Group Multi-View Transformer for 3D Shape Analysis With Spatial Encoding","authors":"Lixiang Xu;Qingzhe Cui;Richang Hong;Wei Xu;Enhong Chen;Xin Yuan;Chenglong Li;Yuanyan Tang","doi":"10.1109/TMM.2024.3394731","DOIUrl":"10.1109/TMM.2024.3394731","url":null,"abstract":"In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the recognition performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9450-9463"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Ding;Haimin Zhang;Gang Fu;Caoqing Jiang;Fei Luo;Chunxia Xiao;Min Xu
{"title":"Towards High-Quality Photorealistic Image Style Transfer","authors":"Hong Ding;Haimin Zhang;Gang Fu;Caoqing Jiang;Fei Luo;Chunxia Xiao;Min Xu","doi":"10.1109/TMM.2024.3394733","DOIUrl":"10.1109/TMM.2024.3394733","url":null,"abstract":"Preserving important textures of the content image and achieving prominent style transfer results remains a challenge in the field of image style transfer. This challenge arises from the entanglement between color and texture during the style transfer process. To address this challenge, we propose an end-to-end network that incorporates adaptive weighted least squares (AWLS) filter, iterative least squares (ILS) filter, and channel separation. Given a content image (\u0000<inline-formula><tex-math>$mathcal {C}$</tex-math></inline-formula>\u0000) and a reference style image (\u0000<inline-formula><tex-math>$mathcal {S}$</tex-math></inline-formula>\u0000), we begin by separating the RGB channels and utilizing ILS filter to decompose them into structure and texture layers. We then perform style transfer on the structural layers using WCT\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000 (incorporating wavelet pooling and unpooling techniques for whitening and coloring transforms) in the R, G, and B channels, respectively. We address the texture distortion caused by WCT\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000 with a texture enhancing (TE) module in the structural layer. Furthermore, we propose an estimating and compensating for the structure loss (ECSL) module. In the ECSL module, with the AWLS filter and the ILS filter, we estimate the texture loss caused by TE, convert the loss of the structural layer to the loss of the texture layer, and compensate for the loss in the texture layer. The final structural layer and the texture layer are merged into the channel style transfer results in the separated R, G, and B channels into the final style transfer result. Thereby, this enables a more complete texture preservation and a significant style transfer process. To evaluate our method, we utilize quantitative experiments using various metrics, including NIQE, AG, SSIM, PSNR, and a user study. The experimental results demonstrate the superiority of our approach over the previous state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9892-9905"},"PeriodicalIF":8.4,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140831559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}