IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Hierarchical Aggregated Graph Neural Network for Skeleton-Based Action Recognition 基于骨架的动作识别的层次聚合图神经网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-07-15 DOI: 10.1109/TMM.2024.3428330
Pei Geng;Xuequan Lu;Wanqing Li;Lei Lyu
{"title":"Hierarchical Aggregated Graph Neural Network for Skeleton-Based Action Recognition","authors":"Pei Geng;Xuequan Lu;Wanqing Li;Lei Lyu","doi":"10.1109/TMM.2024.3428330","DOIUrl":"10.1109/TMM.2024.3428330","url":null,"abstract":"Supervised human action recognition methods based on skeleton data have achieved impressive performance recently. However, many current works emphasize the design of different contrastive strategies to gain stronger supervised signals, ignoring the crucial role of the model's encoder in encoding fine-grained action representations. Our key insight is that a superior skeleton encoder can effectively exploit the fine-grained dependencies between different skeleton information (e.g., joint, bone, angle) in mining more discriminative fine-grained features. In this paper, we devise an innovative hierarchical aggregated graph neural network (HA-GNN) that involves several core components. In particular, the proposed hierarchical graph convolution (HGC) module learns the complementary semantic information among joint, bone, and angle in a hierarchical manner. The designed pyramid attention fusion mechanism (PAFM) fuses the skeleton features successively to compensate for the action representations obtained by the HGC. We use the multi-scale temporal convolution (MSTC) module to enrich the expression capability of temporal features. In addition, to learn more comprehensive semantic representations of the skeleton, we construct a multi-task learning framework with simple contrastive learning and design the learnable data-enhanced strategy to acquire different data representations. Extensive experiments on NTU RGB+D 60/120, NW-UCLA, Kinetics-400, UAV-Human, and PKUMMD datasets prove that the proposed HA-GNN without contrastive learning achieves state-of-the-art performance in skeleton-based action recognition, and it achieves even better results with contrastive learning.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11003-11017"},"PeriodicalIF":8.4,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LMEye: An Interactive Perception Network for Large Language Models LMEye:大型语言模型的交互式感知网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-07-15 DOI: 10.1109/TMM.2024.3428317
Yunxin Li;Baotian Hu;Xinyu Chen;Lin Ma;Yong Xu;Min Zhang
{"title":"LMEye: An Interactive Perception Network for Large Language Models","authors":"Yunxin Li;Baotian Hu;Xinyu Chen;Lin Ma;Yong Xu;Min Zhang","doi":"10.1109/TMM.2024.3428317","DOIUrl":"10.1109/TMM.2024.3428317","url":null,"abstract":"Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10952-10964"},"PeriodicalIF":8.4,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AnimeDiff: Customized Image Generation of Anime Characters Using Diffusion Model AnimeDiff:利用扩散模型生成动漫人物的定制图像
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-07-08 DOI: 10.1109/TMM.2024.3415357
Yuqi Jiang;Qiankun Liu;Dongdong Chen;Lu Yuan;Ying Fu
{"title":"AnimeDiff: Customized Image Generation of Anime Characters Using Diffusion Model","authors":"Yuqi Jiang;Qiankun Liu;Dongdong Chen;Lu Yuan;Ying Fu","doi":"10.1109/TMM.2024.3415357","DOIUrl":"10.1109/TMM.2024.3415357","url":null,"abstract":"Due to the unprecedented power of text-to-image diffusion models, customizing these models to generate new concepts has gained increasing attention. Existing works have achieved some success on real-world concepts, but fail on the concepts of anime characters. We empirically find that such low quality comes from the newly introduced identifier text tokens, which are optimized to identify different characters. In this paper, we propose \u0000<italic>AnimeDiff</i>\u0000 which focuses on customized image generation of anime characters. Our AnimeDiff directly binds anime characters with their names and keeps the embeddings of text tokens unchanged. Furthermore, when composing multiple characters in a single image, the model tends to confuse the properties of those characters. To address this issue, our AnimeDiff incorporates a \u0000<italic>Cut-and-Paste</i>\u0000 data augmentation strategy that produces multi-character images for training by cutting and pasting multiple characters onto background images. Experiments are conducted to prove the superiority of AnimeDiff over other methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10559-10572"},"PeriodicalIF":8.4,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset 实现高效的视频压缩伪影检测和去除:基准数据集
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-07-03 DOI: 10.1109/TMM.2024.3414549
Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao
{"title":"Toward Efficient Video Compression Artifact Detection and Removal: A Benchmark Dataset","authors":"Liqun Lin;Mingxing Wang;Jing Yang;Keke Zhang;Tiesong Zhao","doi":"10.1109/TMM.2024.3414549","DOIUrl":"10.1109/TMM.2024.3414549","url":null,"abstract":"Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10816-10827"},"PeriodicalIF":8.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Centric Behavior Description in Videos: New Benchmark and Model 视频中以人为中心的行为描述:新基准和模型
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-07-02 DOI: 10.1109/TMM.2024.3414263
Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang
{"title":"Human-Centric Behavior Description in Videos: New Benchmark and Model","authors":"Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3414263","DOIUrl":"10.1109/TMM.2024.3414263","url":null,"abstract":"In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10867-10878"},"PeriodicalIF":8.4,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames 利用元优化帧进行高效跨模态视频检索
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-06-28 DOI: 10.1109/TMM.2024.3416669
Ning Han;Xun Yang;Ee-Peng Lim;Hao Chen;Qianru Sun
{"title":"Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames","authors":"Ning Han;Xun Yang;Ee-Peng Lim;Hao Chen;Qianru Sun","doi":"10.1109/TMM.2024.3416669","DOIUrl":"10.1109/TMM.2024.3416669","url":null,"abstract":"Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021}, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10924-10936"},"PeriodicalIF":8.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification 用于微视频多标签分类的多模态渐进调制网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-06-26 DOI: 10.1109/TMM.2024.3405724
Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su
{"title":"Multimodal Progressive Modulation Network for Micro-Video Multi-Label Classification","authors":"Peiguang Jing;Xuan Zhao;Fugui Fan;Fan Yang;Yun Li;Yuting Su","doi":"10.1109/TMM.2024.3405724","DOIUrl":"10.1109/TMM.2024.3405724","url":null,"abstract":"Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10134-10144"},"PeriodicalIF":8.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Alleviating Over-Fitting in Hashing-Based Fine-Grained Image Retrieval: From Causal Feature Learning to Binary-Injected Hash Learning 缓解基于哈希算法的细粒度图像检索中的过度拟合:从因果特征学习到二元注入哈希学习
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3410136
Xinguang Xiang;Xinhao Ding;Lu Jin;Zechao Li;Jinhui Tang;Ramesh Jain
{"title":"Alleviating Over-Fitting in Hashing-Based Fine-Grained Image Retrieval: From Causal Feature Learning to Binary-Injected Hash Learning","authors":"Xinguang Xiang;Xinhao Ding;Lu Jin;Zechao Li;Jinhui Tang;Ramesh Jain","doi":"10.1109/TMM.2024.3410136","DOIUrl":"10.1109/TMM.2024.3410136","url":null,"abstract":"Hashing-based fine-grained image retrieval pursues learning diverse local features to generate inter-class discriminative hash codes. However, existing fine-grained hash methods with attention mechanisms usually tend to just focus on a few obvious areas, which misguides the network to over-fit some salient features. Such a problem raises two main limitations. 1) It overlooks some subtle local features, degrading the generalization capability of learned embedding. 2) It causes the over-activation of some hash bits correlated to salient features, which breaks the binary code balance and further weakens the discrimination abilities of hash codes. To address these limitations of the over-fitting problem, we propose a novel hash framework from \u0000<bold>C</b>\u0000ausal \u0000<bold>F</b>\u0000eature learning to \u0000<bold>B</b>\u0000inary-injected \u0000<bold>H</b>\u0000ash learning (\u0000<bold>CFBH</b>\u0000), which captures various local information and suppresses over-activated hash bits simultaneously. For causal feature learning, we adopt causal inference theory to alleviate the bias towards the salient regions in fine-grained images. In detail, we obtain local features from the feature map and combine this local information with original image information followed by this theory. Theoretically, these fused embeddings help the network to re-weight the retrieval effort of each local feature and exploit more subtle variations without observational bias. For binary-injected hash learning, we propose a Binary Noise Injection (BNI) module inspired by Dropout. The BNI module not only mitigates over-activation to particular bits, but also makes hash codes uncorrelated and balanced in the Hamming space. Extensive experimental results on six popular fine-grained image datasets demonstrate the superiority of CFBH over several State-of-the-Art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10665-10677"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification 用于无人机 RGB-Infrared 车辆再识别的解耦特征学习网络中的关系感知权重共享
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3400675
Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong
{"title":"Relation-Aware Weight Sharing in Decoupling Feature Learning Network for UAV RGB-Infrared Vehicle Re-Identification","authors":"Xingyue Liu;Jiahao Qi;Chen Chen;Kangcheng Bin;Ping Zhong","doi":"10.1109/TMM.2024.3400675","DOIUrl":"10.1109/TMM.2024.3400675","url":null,"abstract":"Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with \u0000<bold>16015</b>\u0000 RGB and \u0000<bold>13913</b>\u0000 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9839-9853"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141517081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback 对齐和检索:有文本反馈的图像检索中的合成与分解学习
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2024-06-21 DOI: 10.1109/TMM.2024.3417694
Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen
{"title":"Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback","authors":"Yahui Xu;Yi Bin;Jiwei Wei;Yang Yang;Guoqing Wang;Heng Tao Shen","doi":"10.1109/TMM.2024.3417694","DOIUrl":"10.1109/TMM.2024.3417694","url":null,"abstract":"We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9936-9948"},"PeriodicalIF":8.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141504211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信