IEEE Transactions on Multimedia最新文献

筛选
英文 中文
SOFW: A Synergistic Optimization Framework for Indoor 3D Object Detection
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-01 DOI: 10.1109/TMM.2024.3521782
Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar
{"title":"SOFW: A Synergistic Optimization Framework for Indoor 3D Object Detection","authors":"Kun Dai;Zhiqiang Jiang;Tao Xie;Ke Wang;Dedong Liu;Zhendong Fan;Ruifeng Li;Lijun Zhao;Mohamed Omar","doi":"10.1109/TMM.2024.3521782","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521782","url":null,"abstract":"In this work, we observe that indoor 3D object detection across varied scene domains encompasses both universal attributes and specific features. Based on this insight, we propose SOFW, a synergistic optimization framework that investigates the feasibility of optimizing 3D object detection tasks concurrently spanning several dataset domains. The core of SOFW is identifying domain-shared parameters to encode universal scene attributes, while employing domain-specific parameters to delve into the particularities of each scene domain. Technically, we introduce a set abstraction alteration strategy (SAAS) that embeds learnable domain-specific features into set abstraction layers, thus empowering the network with a refined comprehension for each scene domain. Besides, we develop an element-wise sharing strategy (ESS) to facilitate fine-grained adaptive discernment between domain-shared and domain-specific parameters for network layers. Benefited from the proposed techniques, SOFW crafts feature representations for each scene domain by learning domain-specific parameters, whilst encoding generic attributes and contextual interdependencies via domain-shared parameters. Built upon the classical detection framework VoteNet without any complicated modules, SOFW delivers impressive performances under multiple benchmarks with much fewer total storage footprint. Additionally, we demonstrate that the proposed ESS is a universal strategy and applying it to a voxels-based approach TR3D can realize cutting-edge detection accuracy on all S3DIS, ScanNet, and SUN RGB-D datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"637-651"},"PeriodicalIF":8.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Neural Codec-Empowered 360$^circ$ Video Streaming: A Saliency-Aided Synergistic Approach
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-30 DOI: 10.1109/TMM.2024.3521770
Jianxin Shi;Miao Zhang;Linfeng Shen;Jiangchuan Liu;Lingjun Pu;Jingdong Xu
{"title":"Towards Neural Codec-Empowered 360$^circ$ Video Streaming: A Saliency-Aided Synergistic Approach","authors":"Jianxin Shi;Miao Zhang;Linfeng Shen;Jiangchuan Liu;Lingjun Pu;Jingdong Xu","doi":"10.1109/TMM.2024.3521770","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521770","url":null,"abstract":"Networked 360<inline-formula><tex-math>$^circ$</tex-math></inline-formula> video has become increasingly popular. Despite the immersive experience for users, its sheer data volume, even with the latest H.266 coding and viewport adaptation, remains a significant challenge to today's networks. Recent studies have shown that integrating deep learning into video coding can significantly enhance compression efficiency, providing new opportunities for high-quality video streaming. In this work, we conduct a comprehensive analysis of the potential and issues in applying neural codecs to 360<inline-formula><tex-math>$^circ$</tex-math></inline-formula> video streaming. We accordingly present <inline-formula><tex-math>$mathsf {NETA}$</tex-math></inline-formula>, a synergistic streaming scheme that merges neural compression with traditional coding techniques, seamlessly implemented within an edge intelligence framework. To address the non-trivial challenges in the short viewport prediction window and time-varying viewing directions, we propose implicit-explicit buffer-based prefetching grounded in content visual saliency and bitrate adaptation with smart model switching around viewports. A novel Lyapunov-guided deep reinforcement learning algorithm is developed to maximize user experience and ensure long-term system stability. We further discuss the concerns towards practical development and deployment and have built a working prototype that verifies <inline-formula><tex-math>$mathsf {NETA}$</tex-math></inline-formula>’s excellent performance. For instance, it achieves a 27% increment in viewing quality, a 90% reduction in rebuffering time, and a 64% decrease in quality variation on average, compared to state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1588-1600"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Local Features by Reinforcing Spatial Structure Information
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-30 DOI: 10.1109/TMM.2024.3521777
Li Wang;Yunzhou Zhang;Fawei Ge;Wenjing Bai;Yifan Wang
{"title":"Learning Local Features by Reinforcing Spatial Structure Information","authors":"Li Wang;Yunzhou Zhang;Fawei Ge;Wenjing Bai;Yifan Wang","doi":"10.1109/TMM.2024.3521777","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521777","url":null,"abstract":"Learning-based local feature extraction algorithms have advanced considerably in terms of robustness. While excelling at enhancing feature robustness, some outstanding algorithms tend to neglect discriminability—a crucial aspect in vision tasks. With the increase of deep learning convolutional layers, we observe an amplification of semantic information within images, accompanied by a diminishing presence of spatial structural information. This imbalance primarily contributes to the subpar feature discriminability. Therefore, this paper introduces a novel network framework aimed at imbuing feature descriptors with robustness and discriminative power by reinforcing spatial structural information. Our approach incorporates a spatial structure enhancement module into the network architecture, spanning from shallow to deep layers, ensuring the retention of rich structural information in deeper layers, thereby enhancing discriminability. Finally, we evaluate our method, demonstrating superior performance in visual localization and feature-matching tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1420-1431"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disaggregation Distillation for Person Search 人物搜索的分解蒸馏
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-30 DOI: 10.1109/TMM.2024.3521732
Yizhen Jia;Rong Quan;Haiyan Chen;Jiamei Liu;Yichao Yan;Song Bai;Jie Qin
{"title":"Disaggregation Distillation for Person Search","authors":"Yizhen Jia;Rong Quan;Haiyan Chen;Jiamei Liu;Yichao Yan;Song Bai;Jie Qin","doi":"10.1109/TMM.2024.3521732","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521732","url":null,"abstract":"Person search is a challenging task in computer vision and multimedia understanding, which aims at localizing and identifying target individuals in realistic scenes. State-of-the-art models achieve remarkable success but suffer from overloaded computation and inefficient inference, making them impractical in most real-world applications. A promising approach to tackle this dilemma is to compress person search models with knowledge distillation (KD). Previous KD-based person search methods typically distill the knowledge from the re-identification (re-id) branch, completely overlooking the useful knowledge from the detection branch. In addition, we elucidate that the imbalance between person and background regions in feature maps has a negative impact on the distillation process. To this end, we propose a novel KD-based approach, namely Disaggregation Distillation for Person Search (DDPS), which disaggregates the distillation process and feature maps, respectively. Firstly, the distillation process is disaggregated into two task-oriented sub-processes, <italic>i.e.</i>, detection distillation and re-id distillation, to help the student learn both accurate localization capability and discriminative person embeddings. Secondly, we disaggregate each feature map into person and background regions, and distill these two regions independently to alleviate the imbalance problem. More concretely, three types of distillation modules, <italic>i.e.</i>, logit distillation (LD), correlation distillation (CD), and disaggregation feature distillation (DFD), are particularly designed to transfer comprehensive information from the teacher to the student. Note that such a simple yet effective distillation scheme can be readily applied to both homogeneous and heterogeneous teacher-student combinations. We conduct extensive experiments on two person search benchmarks, where the results demonstrate that, surprisingly, our DDPS enables the student model to surpass the performance of the corresponding teacher model, even achieving comparable results with general person search models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"158-170"},"PeriodicalIF":8.4,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Multimedia Publication Information IEEE多媒体出版信息汇刊
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3444988
{"title":"IEEE Transactions on Multimedia Publication Information","authors":"","doi":"10.1109/TMM.2024.3444988","DOIUrl":"https://doi.org/10.1109/TMM.2024.3444988","url":null,"abstract":"","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"C2-C2"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10817140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-Aware Pre-Selected Neural Rendering for Light Field Reconstruction
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521784
Song Chang;Youfang Lin;Shuo Zhang
{"title":"Structure-Aware Pre-Selected Neural Rendering for Light Field Reconstruction","authors":"Song Chang;Youfang Lin;Shuo Zhang","doi":"10.1109/TMM.2024.3521784","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521784","url":null,"abstract":"As densely-sampled Light Field (LF) images are beneficial to many applications, LF reconstruction becomes an important technology in related fields. Recently, neural rendering shows great potential in reconstruction tasks. However, volume rendering in existing methods needs to sample many points on the whole camera ray or epipolar line, which is time-consuming. In this paper, specifically for LF images with regular angular sampling, we propose a novel Structure-Aware Pre-Selected neural rendering framework for LF reconstruction. Instead of sampling on the whole epipolar line, we propose to sample on several specific positions, which are estimated using the color and inherent scene structure information explored in the regular angular sampled LF images. Sampling only a few points that closely match the target pixel, the feature of the target pixel is quickly rendered with high-quality. Finally, we fuse the features and decode them in the view dimension to obtain the final target view. Experiments show that the proposed method outperforms the state-of-the-art LF reconstruction methods in both qualitative and quantitative comparisons across various tasks. Our method also surpasses the most existing methods in terms of speed. Moreover, without any retraining or fine-tuning, the performance of our method with no-per-scene optimization is even better than the methods with per-scene optimization.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1574-1587"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition GPT4Ego:释放零射击自我中心行动识别预训练模型的潜力
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521658
Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang
{"title":"GPT4Ego: Unleashing the Potential of Pre-Trained Models for Zero-Shot Egocentric Action Recognition","authors":"Guangzhao Dai;Xiangbo Shu;Wenhao Wu;Rui Yan;Jiachao Zhang","doi":"10.1109/TMM.2024.3521658","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521658","url":null,"abstract":"Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, <italic>aka</i> GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP<inline-formula><tex-math>$spadesuit$</tex-math></inline-formula>) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP<inline-formula><tex-math>$clubsuit$</tex-math></inline-formula>) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+9.4}}$</tex-math></inline-formula>), EGTEA (39.6%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+5.5}}$</tex-math></inline-formula>), and CharadesEgo (31.5%<inline-formula><tex-math>$uparrow$</tex-math></inline-formula><inline-formula><tex-math>$_{bm {+2.6}}$</tex-math></inline-formula>). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"401-413"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VGNet: Multimodal Feature Extraction and Fusion Network for 3D CAD Model Retrieval
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521706
Feiwei Qin;Gaoyang Zhan;Meie Fang;C. L. Philip Chen;Ping Li
{"title":"VGNet: Multimodal Feature Extraction and Fusion Network for 3D CAD Model Retrieval","authors":"Feiwei Qin;Gaoyang Zhan;Meie Fang;C. L. Philip Chen;Ping Li","doi":"10.1109/TMM.2024.3521706","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521706","url":null,"abstract":"The reuse of 3D CAD models is crucial for industrial manufacturing because it shortens development cycles and reduces costs. Significant progress has been made in deep learning-based 3D model retrievals. There are many representations for 3D models, among which the multi-view representation has demonstrated a superior retrieval performance. However, directly applying these 3D model retrieval approaches to 3D CAD model retrievals may result in issues such as the loss of the engineering semantic and structural information. In this paper, we find that multiple views and B-rep can complement each other. Therefore, we propose the view graph neural network (VGNet), which effectively combines multiple views and B-rep to accomplish 3D CAD model retrieval. More specifically, based on the characteristics of the regular shape of 3D CAD models, and the richness of the attribute information in the B-rep attribute graph, we separately design two feature extraction networks for each modality. Moreover, to explore the latent relationships between the multiple views and B-rep attribute graphs, a multi-head attention enhancement module is designed. Furthermore, the multimodal fusion module is adopted to make the joint representation of the 3D CAD models more discriminative by using a correlation loss function. Experiments are carried out on a real manufacturing 3D CAD dataset and a public dataset to validate the effectiveness of the proposed approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1432-1447"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiview Feature Decoupling for Deep Subspace Clustering
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-27 DOI: 10.1109/TMM.2024.3521776
Yuxiu Lin;Hui Liu;Ren Wang;Qiang Guo;Caiming Zhang
{"title":"Multiview Feature Decoupling for Deep Subspace Clustering","authors":"Yuxiu Lin;Hui Liu;Ren Wang;Qiang Guo;Caiming Zhang","doi":"10.1109/TMM.2024.3521776","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521776","url":null,"abstract":"Deep multi-view subspace clustering aims to reveal a common subspace structure by exploiting rich multi-view information. Despite promising progress, current methods focus only on multi-view consistency and complementarity, often overlooking the adverse influence of entangled superfluous information in features. Moreover, most existing works lack scalability and are inefficient for large-scale scenarios. To this end, we innovatively propose a deep subspace clustering method via Multi-view Feature Decoupling (MvFD). First, MvFD incorporates well-designed multi-type auto-encoders with self-supervised learning, explicitly decoupling consistent, complementary, and superfluous features for every view. The disentangled and interpretable feature space can then better serve unified representation learning. By integrating these three types of information within a unified framework, we employ information theory to obtain a minimal and sufficient representation with high discriminability. Besides, we introduce a deep metric network to model self-expression correlation more efficiently, where network parameters remain unaffected by changes in sample numbers. Extensive experiments show that MvFD yields State-of-the-Art performance in various types of multi-view datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"544-556"},"PeriodicalIF":8.4,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual Stream Relation Learning Network for Image-Text Retrieval
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521736
Dongqing Wu;Huihui Li;Cang Gu;Lei Guo;Hang Liu
{"title":"Dual Stream Relation Learning Network for Image-Text Retrieval","authors":"Dongqing Wu;Huihui Li;Cang Gu;Lei Guo;Hang Liu","doi":"10.1109/TMM.2024.3521736","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521736","url":null,"abstract":"Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1551-1565"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信