IEEE transactions on pattern analysis and machine intelligence最新文献

筛选
英文 中文
Stimulating Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling. 通过自适应嵌入和集合实现图像去噪的刺激扩散模型
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432812
Tong Li, Hansen Feng, Lizhi Wang, Lin Zhu, Zhiwei Xiong, Hua Huang
{"title":"Stimulating Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling.","authors":"Tong Li, Hansen Feng, Lizhi Wang, Lin Zhu, Zhiwei Xiong, Hua Huang","doi":"10.1109/TPAMI.2024.3432812","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432812","url":null,"abstract":"<p><p>Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising. The code is available at https://github.com/Li-Tong-621/DMID.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DifFace: Blind Face Restoration with Diffused Error Contraction. DifFace:利用漫反射误差收缩进行盲人面容修复
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432651
Zongsheng Yue, Chen Change Loy
{"title":"DifFace: Blind Face Restoration with Diffused Error Contraction.","authors":"Zongsheng Yue, Chen Change Loy","doi":"10.1109/TPAMI.2024.3432651","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432651","url":null,"abstract":"<p><p>While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L<sub>1</sub> loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Code and model are available at https://github.com/zsyOAOA/DifFace.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Cut via Hierarchical Sequence/Set Model for Efficient Mixed-Integer Programming. 通过分层序列/集合模型学习切割,实现高效混合整数编程。
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432716
Jie Wang, Zhihai Wang, Xijun Li, Yufei Kuang, Zhihao Shi, Fangzhou Zhu, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, Feng Wu
{"title":"Learning to Cut via Hierarchical Sequence/Set Model for Efficient Mixed-Integer Programming.","authors":"Jie Wang, Zhihai Wang, Xijun Li, Yufei Kuang, Zhihao Shi, Fangzhou Zhu, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, Feng Wu","doi":"10.1109/TPAMI.2024.3432716","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432716","url":null,"abstract":"<p><p>Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module-that formulates the cut selection as a sequence/set to sequence learning problem-to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inductive State-Relabeling Adversarial Active Learning with Heuristic Clique Rescaling. 采用启发式克利克重缩放的归纳式状态重标注逆向主动学习。
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432099
Beichen Zhang, Liang Li, Shuhui Wang, Shaofei Cai, Zheng-Jun Zha, Qi Tian, Qingming Huang
{"title":"Inductive State-Relabeling Adversarial Active Learning with Heuristic Clique Rescaling.","authors":"Beichen Zhang, Liang Li, Shuhui Wang, Shaofei Cai, Zheng-Jun Zha, Qi Tian, Qingming Huang","doi":"10.1109/TPAMI.2024.3432099","DOIUrl":"10.1109/TPAMI.2024.3432099","url":null,"abstract":"<p><p>Active learning (AL) is to design label-efficient algorithms by labeling the most representative samples. It reduces annotation cost and attracts increasing attention from the community. However, previous AL methods suffer from the inadequacy of annotations and unreliable uncertainty estimation. Moreover, we find that they ignore the intra-diversity of selected samples, which leads to sampling redundancy. In view of these challenges, we propose an inductive state-relabeling adversarial AL model (ISRA) that consists of a unified representation generator, an inductive state-relabeling discriminator, and a heuristic clique rescaling module. The generator introduces contrastive learning to leverage unlabeled samples for self-supervised training, where the mutual information is utilized to improve the representation quality for AL selection. Then, we design an inductive uncertainty indicator to learn the state score from labeled data and relabel unlabeled data with different importance for better discrimination of instructive samples. To solve the problem of sampling redundancy, the heuristic clique rescaling module measures the intra-diversity of candidate samples and recurrently rescales them to select the most informative samples. The experiments conducted on eight datasets and two imbalanced scenarios show that our model outperforms the previous state-of-the-art AL methods. As an extension on the cross-modal AL task, we apply ISRA to the image captioning and it also achieves superior performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Latent Semantic and Disentangled Attention. 潜在语义和分离注意力
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432631
Jen-Tzung Chien, Yu-Han Huang
{"title":"Latent Semantic and Disentangled Attention.","authors":"Jen-Tzung Chien, Yu-Han Huang","doi":"10.1109/TPAMI.2024.3432631","DOIUrl":"10.1109/TPAMI.2024.3432631","url":null,"abstract":"<p><p>Sequential learning using transformer has achieved state-of-the-art performance in natural language tasks and many others. The key to this success is the multi-head self attention which encodes and gathers the features from individual tokens of an input sequence. The mapping or decoding is performed to produce an output sequence via cross attention. There are threefold weaknesses by using such an attention framework. First, since the attention would mix up the features of different tokens in input and output sequences, it is likely that redundant information exists in sequence data representation. Second, the patterns of attention weights among different heads tend to be similar. The model capacity is bounded. Third, the robustness in an encoder-decoder network against the model uncertainty is disregarded. To handle these weaknesses, this paper presents a Bayesian semantic and disentangled mask attention to learn latent disentanglement in multi-head attention where the redundant features in transformer are compensated with the latent topic information. The attention weights are filtered by a mask which is optimized through semantic clustering. This attention mechanism is implemented according to Bayesian learning for clustered disentanglement. The experiments on machine translation and speech recognition show the merit of Bayesian clustered disentanglement for mask attention.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HDF-Net: Capturing Homogeny Difference Features to Localize the Tampered Image. HDF-Net:捕捉同质差异特征,定位被篡改的图像。
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432551
Ruidong Han, Xiaofeng Wang, Ningning Bai, Yihang Wang, Jianpeng Hou, Jianru Xue
{"title":"HDF-Net: Capturing Homogeny Difference Features to Localize the Tampered Image.","authors":"Ruidong Han, Xiaofeng Wang, Ningning Bai, Yihang Wang, Jianpeng Hou, Jianru Xue","doi":"10.1109/TPAMI.2024.3432551","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432551","url":null,"abstract":"<p><p>Modern image editing software enables anyone to alter the content of an image to deceive the public, which can pose a security hazard to personal privacy and public safety. The detection and localization of image tampering is becoming an urgent issue to be addressed. We have revealed that the tampered region exhibits homogenous differences (the changes in metadata organization form and organization structure of the image) from the real region after manipulations such as splicing, copy-move, and removal. Therefore, we propose a novel end-to-end network named HDF-Net to extract these homogeny difference features for precise localization of tampering artifacts. The HDF-Net is composed of RGB and SRM dual-stream networks, including three complementary modules, namely the suspicious tampering-artifact prominent (STP) module, the fine tampering-artifact salient (FTS) module, and the tampering-artifact edge refined (TER) module. We utilize the fully attentional block (FLA) to enhance the characterization ability of homogeny difference features extracted by each module and preserve the specifics of tampering artifacts. These modules are gradually merged according to the strategy of \"coarse-fine-finer\", which significantly improves the localization accuracy and edge refinement. Extensive experiments demonstrate that HDF-Net performs better than state-of-the-art tampering localization models on five benchmarks, achieving satisfactory generalization and robustness. Code can be found at https://github.com/ruidonghan/HDF-Net/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation. 反思自我监督语义分割:实现端到端分割。
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432326
Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang
{"title":"Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.","authors":"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang","doi":"10.1109/TPAMI.2024.3432326","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432326","url":null,"abstract":"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FEditNet++: Few-Shot Editing of Latent Semantics in GAN Spaces with Correlated Attribute Disentanglement. FEditNet++:在具有相关属性解缠的 GAN 空间中对潜在语义进行快速编辑
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432529
Ran Yi, Teng Hu, Mengfei Xia, Yizhe Tang, Yong-Jin Liu
{"title":"FEditNet++: Few-Shot Editing of Latent Semantics in GAN Spaces with Correlated Attribute Disentanglement.","authors":"Ran Yi, Teng Hu, Mengfei Xia, Yizhe Tang, Yong-Jin Liu","doi":"10.1109/TPAMI.2024.3432529","DOIUrl":"10.1109/TPAMI.2024.3432529","url":null,"abstract":"<p><p>Generative Adversarial Networks have achieved significant advancements in generating and editing high-resolution images. However, most methods suffer from either requiring extensive labeled datasets or strong prior knowledge. It is also challenging for them to disentangle correlated attributes with few-shot data. In this paper, we propose FEditNet++, a GAN-based approach to explore latent semantics. It aims to enable attribute editing with limited labeled data and disentangle the correlated attributes. We propose a layer-wise feature contrastive objective, which takes into consideration content consistency and facilitates the invariance of the unrelated attributes before and after editing. Furthermore, we harness the knowledge from the pretrained discriminative model to prevent overfitting. In particular, to solve the entanglement problem between the correlated attributes from data and semantic latent correlation, we extend our model to jointly optimize multiple attributes and propose a novel decoupling loss and cross-assessment loss to disentangle them from both latent and image space. We further propose a novel-attribute disentanglement strategy to enable editing of novel attributes with unknown entanglements. Finally, we extend our model to accurately edit the fine-grained attributes. Qualitative and quantitative assessments demonstrate that our method outperforms state-of-the-art approaches across various datasets, including CelebA-HQ, RaFD, Danbooru2018 and LSUN Church.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Flexible Semantic Guided Model for Single Image Enhancement and Restoration. 为单一图像增强和修复建立灵活的语义指导模型
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432308
Yuhui Wu, Guoqing Wang, Shaochong Liu, Yang Yang, Wei Li, Xiongxin Tang, Shuhang Gu, Chongyi Li, Heng Tao Shen
{"title":"Towards a Flexible Semantic Guided Model for Single Image Enhancement and Restoration.","authors":"Yuhui Wu, Guoqing Wang, Shaochong Liu, Yang Yang, Wei Li, Xiongxin Tang, Shuhang Gu, Chongyi Li, Heng Tao Shen","doi":"10.1109/TPAMI.2024.3432308","DOIUrl":"10.1109/TPAMI.2024.3432308","url":null,"abstract":"<p><p>Low-light image enhancement (LLIE) investigates how to improve the brightness of an image captured in illumination-insufficient environments. The majority of existing methods enhance low-light images in a global and uniform manner, without taking into account the semantic information of different regions. Consequently, a network may easily deviate from the original color of local regions. To address this issue, we propose a semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that adaptively integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in the LLIE task. We further present a refined framework SKF++ with two new techniques: (a) Extra convolutional branch for intra-class illumination and color recovery through extracting local information and (b) Equalization-based histogram transformation for contrast enhancement and high dynamic range adjustment. Extensive experiments on various benchmarks of LLIE task and other image processing tasks show that models equipped with the SKF/SKF++ significantly outperform the baselines and our SKF/SKF++ generalizes to different models and scenes well. Besides, the potential benefits of our method in face detection and semantic segmentation in low-light conditions are discussed. The code and pre-trained models have been publicly available at https://github.com/langmanbusi/Semantic-Aware-Low-Light-Image-Enhancement.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge. 通过多模态对齐概念知识进行非配对图像-文本匹配。
IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-07-23 DOI: 10.1109/TPAMI.2024.3432552
Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang
{"title":"Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge.","authors":"Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang","doi":"10.1109/TPAMI.2024.3432552","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432552","url":null,"abstract":"<p><p>Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信