IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Disaggregation Distillation for Person Search 人物搜索的分解蒸馏
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-30 DOI: 10.1109/TMM.2024.3521732
Yizhen Jia;Rong Quan;Haiyan Chen;Jiamei Liu;Yichao Yan;Song Bai;Jie Qin
{"title":"Disaggregation Distillation for Person Search","authors":"Yizhen Jia;Rong Quan;Haiyan Chen;Jiamei Liu;Yichao Yan;Song Bai;Jie Qin","doi":"10.1109/TMM.2024.3521732","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521732","url":null,"abstract":"Person search is a challenging task in computer vision and multimedia understanding, which aims at localizing and identifying target individuals in realistic scenes. State-of-the-art models achieve remarkable success but suffer from overloaded computation and inefficient inference, making them impractical in most real-world applications. A promising approach to tackle this dilemma is to compress person search models with knowledge distillation (KD). Previous KD-based person search methods typically distill the knowledge from the re-identification (re-id) branch, completely overlooking the useful knowledge from the detection branch. In addition, we elucidate that the imbalance between person and background regions in feature maps has a negative impact on the distillation process. To this end, we propose a novel KD-based approach, namely Disaggregation Distillation for Person Search (DDPS), which disaggregates the distillation process and feature maps, respectively. Firstly, the distillation process is disaggregated into two task-oriented sub-processes, <italic>i.e.</i>, detection distillation and re-id distillation, to help the student learn both accurate localization capability and discriminative person embeddings. Secondly, we disaggregate each feature map into person and background regions, and distill these two regions independently to alleviate the imbalance problem. More concretely, three types of distillation modules, <italic>i.e.</i>, logit distillation (LD), correlation distillation (CD), and disaggregation feature distillation (DFD), are particularly designed to transfer comprehensive information from the teacher to the student. Note that such a simple yet effective distillation scheme can be readily applied to both homogeneous and heterogeneous teacher-student combinations. We conduct extensive experiments on two person search benchmarks, where the results demonstrate that, surprisingly, our DDPS enables the student model to surpass the performance of the corresponding teacher model, even achieving comparable results with general person search models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"158-170"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution 用于遥感图像超分辨率的频率辅助曼巴
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-30 DOI: 10.1109/TMM.2024.3521798
Yi Xiao;Qiangqiang Yuan;Kui Jiang;Yuzeng Chen;Qiang Zhang;Chia-Wen Lin
{"title":"Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution","authors":"Yi Xiao;Qiangqiang Yuan;Kui Jiang;Yuzeng Chen;Qiang Zhang;Chia-Wen Lin","doi":"10.1109/TMM.2024.3521798","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521798","url":null,"abstract":"Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Considering that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1783-1796"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Multimedia Publication Information IEEE多媒体出版信息汇刊
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3444988
{"title":"IEEE Transactions on Multimedia Publication Information","authors":"","doi":"10.1109/TMM.2024.3444988","DOIUrl":"https://doi.org/10.1109/TMM.2024.3444988","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"C2-C2"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10817140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-Aware Pre-Selected Neural Rendering for Light Field Reconstruction 面向光场重建的结构感知预选择神经渲染
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521784
Song Chang;Youfang Lin;Shuo Zhang
{"title":"Structure-Aware Pre-Selected Neural Rendering for Light Field Reconstruction","authors":"Song Chang;Youfang Lin;Shuo Zhang","doi":"10.1109/TMM.2024.3521784","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521784","url":null,"abstract":"As densely-sampled Light Field (LF) images are beneficial to many applications, LF reconstruction becomes an important technology in related fields. Recently, neural rendering shows great potential in reconstruction tasks. However, volume rendering in existing methods needs to sample many points on the whole camera ray or epipolar line, which is time-consuming. In this paper, specifically for LF images with regular angular sampling, we propose a novel Structure-Aware Pre-Selected neural rendering framework for LF reconstruction. Instead of sampling on the whole epipolar line, we propose to sample on several specific positions, which are estimated using the color and inherent scene structure information explored in the regular angular sampled LF images. Sampling only a few points that closely match the target pixel, the feature of the target pixel is quickly rendered with high-quality. Finally, we fuse the features and decode them in the view dimension to obtain the final target view. Experiments show that the proposed method outperforms the state-of-the-art LF reconstruction methods in both qualitative and quantitative comparisons across various tasks. Our method also surpasses the most existing methods in terms of speed. Moreover, without any retraining or fine-tuning, the performance of our method with no-per-scene optimization is even better than the methods with per-scene optimization.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1574-1587"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition GPT4Ego:释放零射击自我中心行动识别预训练模型的潜力
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521658
Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang
{"title":"GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition","authors":"Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang","doi":"10.1109/TMM.2024.3521658","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521658","url":null,"abstract":"Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, <italic>aka</i> GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP<inline-formula><tex-math>$spadesuit$</tex-math></inline-formula>) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP<inline-formula><tex-math>$clubsuit$</tex-math></inline-formula>) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+9.4}}$</tex-math></inline-formula>), EGTEA (39.6%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+5.5}}$</tex-math></inline-formula>), and CharadesEgo (31.5%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+2.6}}$</tex-math></inline-formula>). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"401-413"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VGNet: Multimodal Feature Extraction and Fusion Network for 3D CAD Model Retrieval 面向三维CAD模型检索的多模态特征提取与融合网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521706
Feiwei Qin;Gaoyang Zhan;Meie Fang;C. L. Philip Chen;Ping Li
{"title":"VGNet: Multimodal Feature Extraction and Fusion Network for 3D CAD Model Retrieval","authors":"Feiwei Qin;Gaoyang Zhan;Meie Fang;C. L. Philip Chen;Ping Li","doi":"10.1109/TMM.2024.3521706","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521706","url":null,"abstract":"The reuse of 3D CAD models is crucial for industrial manufacturing because it shortens development cycles and reduces costs. Significant progress has been made in deep learning-based 3D model retrievals. There are many representations for 3D models, among which the multi-view representation has demonstrated a superior retrieval performance. However, directly applying these 3D model retrieval approaches to 3D CAD model retrievals may result in issues such as the loss of the engineering semantic and structural information. In this paper, we find that multiple views and B-rep can complement each other. Therefore, we propose the view graph neural network (VGNet), which effectively combines multiple views and B-rep to accomplish 3D CAD model retrieval. More specifically, based on the characteristics of the regular shape of 3D CAD models, and the richness of the attribute information in the B-rep attribute graph, we separately design two feature extraction networks for each modality. Moreover, to explore the latent relationships between the multiple views and B-rep attribute graphs, a multi-head attention enhancement module is designed. Furthermore, the multimodal fusion module is adopted to make the joint representation of the 3D CAD models more discriminative by using a correlation loss function. Experiments are carried out on a real manufacturing 3D CAD dataset and a public dataset to validate the effectiveness of the proposed approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1432-1447"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiview Feature Decoupling for Deep Subspace Clustering 深度子空间聚类的多视图特征解耦
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521776
Yuxiu Lin;Hui Liu;Ren Wang;Qiang Guo;Caiming Zhang
{"title":"Multiview Feature Decoupling for Deep Subspace Clustering","authors":"Yuxiu Lin;Hui Liu;Ren Wang;Qiang Guo;Caiming Zhang","doi":"10.1109/TMM.2024.3521776","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521776","url":null,"abstract":"Deep multi-view subspace clustering aims to reveal a common subspace structure by exploiting rich multi-view information. Despite promising progress, current methods focus only on multi-view consistency and complementarity, often overlooking the adverse influence of entangled superfluous information in features. Moreover, most existing works lack scalability and are inefficient for large-scale scenarios. To this end, we innovatively propose a deep subspace clustering method via Multi-view Feature Decoupling (MvFD). First, MvFD incorporates well-designed multi-type auto-encoders with self-supervised learning, explicitly decoupling consistent, complementary, and superfluous features for every view. The disentangled and interpretable feature space can then better serve unified representation learning. By integrating these three types of information within a unified framework, we employ information theory to obtain a minimal and sufficient representation with high discriminability. Besides, we introduce a deep metric network to model self-expression correlation more efficiently, where network parameters remain unaffected by changes in sample numbers. Extensive experiments show that MvFD yields State-of-the-Art performance in various types of multi-view datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"544-556"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DuPMAM: An Efficient Dual Perception Framework Equipped With a Sharp Testing Strategy for Point Cloud Analysis DuPMAM:一种高效的双感知框架,配备了锐利的点云分析测试策略
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521735
Yijun Chen;Xianwei Zheng;Zhulun Yang;Xutao Li;Jiantao Zhou;Yuanman Li
{"title":"DuPMAM: An Efficient Dual Perception Framework Equipped With a Sharp Testing Strategy for Point Cloud Analysis","authors":"Yijun Chen;Xianwei Zheng;Zhulun Yang;Xutao Li;Jiantao Zhou;Yuanman Li","doi":"10.1109/TMM.2024.3521735","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521735","url":null,"abstract":"The challenges in point cloud analysis are primarily attributed to the irregular and unordered nature of the data. Numerous existing approaches, inspired by the Transformer, introduce attention mechanisms to extract the 3D geometric features. However, these intricate geometric extractors incur high computational overhead and unfavorable inference latency. To tackle this predicament, in this paper, we propose a lightweight and faster attention-based network, named Dual Perception MAM (DuPMAM), for point cloud analysis. Specifically, we present a novel simple Point Multiplicative Attention Mechanism (PMAM). It is implemented solely through single feed-forward fully connected layers, hence leading to lower model complexity and superior inference speed. Based on that, we further devise a dual perception strategy by constructing both a local attention block and a global attention block to learn fine-grained geometric and overall representational features, respectively. Consequently, compared to the existing approaches, our method has excellent perception of local details and global contours of the point cloud objects. In addition, we ingeniously design a Graph-Multiscale Perceptual Field (GMPF) testing strategy for model performance enhancement. It has significant advantage over the traditional voting strategy and is generally applicable to point cloud tasks, encompassing classification, part segmentation and indoor scene segmentation. Empowered by the GMPF testing strategy, DuPMAM delivers the new State-of-the-Art on the real-world dataset ScanObjectNN, the synthetic dataset ModelNet40 and the part segmentation dataset ShapeNet, and compared to the recent GB-Net, our DuPMAM trains 6 times faster and tests 2 times faster.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1760-1771"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StyleAM: Perception-Oriented Unsupervised Domain Adaption for No-Reference Image Quality Assessment 面向感知的无监督域自适应无参考图像质量评估
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521705
Yiting Lu;Xin Li;Jianzhao Liu;Zhibo Chen
{"title":"StyleAM: Perception-Oriented Unsupervised Domain Adaption for No-Reference Image Quality Assessment","authors":"Yiting Lu;Xin Li;Jianzhao Liu;Zhibo Chen","doi":"10.1109/TMM.2024.3521705","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521705","url":null,"abstract":"Deep neural networks (DNNs) have shown great potential in no-reference image quality assessment (NR-IQA). However, the annotation of NR-IQA is labor-intensive and time-consuming, which severely limits its application, especially for authentic images. To relieve the dependence on quality annotation, some works have applied unsupervised domain adaptation (UDA) to NR-IQA. However, the above methods ignore the fact that the alignment space used in classification is sub-optimal, since the space is not elaborately designed for perception. To solve this challenge, we propose an effective perception-oriented unsupervised domain adaptation method <bold>StyleAM</b> (<bold>Style</b> <bold>A</b>lignment and <bold>M</b>ixup) for NR-IQA, which transfers sufficient knowledge from label-rich source domain data to label-free target domain images. Specifically, we find a more compact and reliable space i.e., feature style space for perception-oriented UDA based on an interesting observation, that the feature style (i.e., the mean and variance) of the deep layer in DNNs is exactly associated with the quality score in NR-IQA. Therefore, we propose to align the source and target domains in a more perceptual-oriented space i.e., the feature style space, to reduce the intervention from other quality-irrelevant feature factors. Furthermore, to increase the consistency (i.e., ordinal/continuous characteristics) between quality score and its feature style, we also propose a novel feature augmentation strategy Style Mixup, which mixes the feature styles (i.e., the mean and variance) before the last layer of DNNs together with mixing their labels. Extensive experimental results on many cross-domain settings (<italic>e.g.</i>, synthetic to authentic, and multiple distortions to one distortion) have demonstrated the effectiveness of our proposed StyleAM on NR-IQA.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2043-2058"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MENSA: Multi-Dataset Harmonized Pretraining for Semantic Segmentation 语义分割的多数据集协调预训练
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521851
Bowen Shi;Xiaopeng Zhang;Yaoming Wang;Wenrui Dai;Junni Zou;Hongkai Xiong
{"title":"MENSA: Multi-Dataset Harmonized Pretraining for Semantic Segmentation","authors":"Bowen Shi;Xiaopeng Zhang;Yaoming Wang;Wenrui Dai;Junni Zou;Hongkai Xiong","doi":"10.1109/TMM.2024.3521851","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521851","url":null,"abstract":"Existing pretraining methods for semantic segmentation are hampered by the task gap between global image -level pretraining and local pixel-level finetuning. Joint dense-level pretraining is a promising alternative to exploit off-the-shelf annotations from diverse segmentation datasets but suffers from low-quality class embeddings and inconsistent data and supervision signals across multiple datasets by directly employing CLIP. To overcome these challenges, we propose a novel <underline>M</u>ulti-datas<underline>E</u>t harmo<underline>N</u>ized pretraining framework for <underline>S</u>emantic s<underline>E</u>gmentation (MENSA). MENSA incorporates high-quality language embeddings and momentum-updated visual embeddings to effectively model the class relationships in the embedding space and thereby provide reliable supervision information for each category. To further adapt to multiple datasets, we achieve one-to-many pixel-embedding pairing with cross-dataset multi-label mapping through cross-modal information exchange to mitigate inconsistent supervision signals and introduce region-level and pixel-level cross-dataset mixing for varying data distribution. Experimental results demonstrate that MENSA is a powerful foundation segmentation model that consistently outperforms popular supervised or unsupervised ImageNet pretrained models for various benchmarks under standard fine-tuning. Furthermore, MENSA is shown to significantly benefit frozen-backbone fine-tuning and zero-shot learning by endowing pixel-level distinctiveness to learned representations.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2127-2140"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143801059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信