International Journal of Computer Vision最新文献

筛选
英文 中文
Hard-Normal Example-Aware Template Mutual Matching for Industrial Anomaly Detection 工业异常检测中可识别硬正常样例的模板相互匹配
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-18 DOI: 10.1007/s11263-024-02323-0
Zixuan Chen, Xiaohua Xie, Lingxiao Yang, Jian-Huang Lai
{"title":"Hard-Normal Example-Aware Template Mutual Matching for Industrial Anomaly Detection","authors":"Zixuan Chen, Xiaohua Xie, Lingxiao Yang, Jian-Huang Lai","doi":"10.1007/s11263-024-02323-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02323-0","url":null,"abstract":"<p>Anomaly detectors are widely used in industrial manufacturing to detect and localize unknown defects in query images. These detectors are trained on anomaly-free samples and have successfully distinguished anomalies from most normal samples. However, hard-normal examples are scattered and far apart from most normal samples, and thus they are often mistaken for anomalies by existing methods. To address this issue, we propose <b>H</b>ard-normal <b>E</b>xample-aware <b>T</b>emplate <b>M</b>utual <b>M</b>atching (HETMM), an efficient framework to build a robust prototype-based decision boundary. Specifically, <i>HETMM</i> employs the proposed <b>A</b>ffine-invariant <b>T</b>emplate <b>M</b>utual <b>M</b>atching (ATMM) to mitigate the affection brought by the affine transformations and easy-normal examples. By mutually matching the pixel-level prototypes within the patch-level search spaces between query and template set, <i>ATMM</i> can accurately distinguish between hard-normal examples and anomalies, achieving low false-positive and missed-detection rates. In addition, we also propose <i>PTS</i> to compress the original template set for speed-up. <i>PTS</i> selects cluster centres and hard-normal examples to preserve the original decision boundary, allowing this tiny set to achieve comparable performance to the original one. Extensive experiments demonstrate that <i>HETMM</i> outperforms state-of-the-art methods, while using a 60-sheet tiny set can achieve competitive performance and real-time inference speed (around 26.1 FPS) on a Quadro 8000 RTX GPU. <i>HETMM</i> is training-free and can be hot-updated by directly inserting novel samples into the template set, which can promptly address some incremental learning issues in industrial manufacturing.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication 超越说话--生成用于交流的整体三维人类双向运动
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-17 DOI: 10.1007/s11263-024-02300-7
Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang
{"title":"Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication","authors":"Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang","doi":"10.1007/s11263-024-02300-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02300-7","url":null,"abstract":"<p>In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the <span>HoCo</span> holistic communication dataset, which is a valuable resource for future research. Our <span>HoCo</span> dataset and code will be released for research purposes upon acceptance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph Hyper-3DG:文本到3d高斯生成通过超图
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-16 DOI: 10.1007/s11263-024-02298-y
Donglin Di, Jiahui Yang, Chaofan Luo, Zhou Xue, Wei Chen, Xun Yang, Yue Gao
{"title":"Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph","authors":"Donglin Di, Jiahui Yang, Chaofan Luo, Zhou Xue, Wei Chen, Xun Yang, Yue Gao","doi":"10.1007/s11263-024-02298-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02298-y","url":null,"abstract":"<p>Text-to-3D generation represents an exciting field that has seen rapid advancements, facilitating the transformation of textual descriptions into detailed 3D models. However, current progress often neglects the intricate high-order correlation of geometry and texture within 3D objects, leading to challenges such as over-smoothness, over-saturation and the Janus problem. In this work, we propose a method named “3D Gaussian Generation via Hypergraph (Hyper-3DG)”, designed to capture the sophisticated high-order correlations present within 3D objects. Our framework is anchored by a well-established mainflow and an essential module, named “Geometry and Texture Hypergraph Refiner (HGRefiner)”. This module not only refines the representation of 3D Gaussians but also accelerates the update process of these 3D Gaussians by conducting the Patch-3DGS Hypergraph Learning on both explicit attributes and latent visual features. Our framework allows for the production of finely generated 3D objects within a cohesive optimization, effectively circumventing degradation. Extensive experimentation has shown that our proposed method significantly enhances the quality of 3D generation while incurring no additional computational overhead for the underlying framework. (Project code: https://github.com/yjhboy/Hyper3DG).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"63 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation-Guided Adversarial Learning for Data-Free Knowledge Transfer 面向无数据知识转移的关系导向对抗学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-13 DOI: 10.1007/s11263-024-02303-4
Yingping Liang, Ying Fu
{"title":"Relation-Guided Adversarial Learning for Data-Free Knowledge Transfer","authors":"Yingping Liang, Ying Fu","doi":"10.1007/s11263-024-02303-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02303-4","url":null,"abstract":"<p>Data-free knowledge distillation transfers knowledge by recovering training data from a pre-trained model. Despite the recent success of seeking global data diversity, the diversity within each class and the similarity among different classes are largely overlooked, resulting in data homogeneity and limited performance. In this paper, we introduce a novel Relation-Guided Adversarial Learning method with triplet losses, which solves the homogeneity problem from two aspects. To be specific, our method aims to promote both intra-class diversity and inter-class confusion of the generated samples. To this end, we design two phases, an image synthesis phase and a student training phase. In the image synthesis phase, we construct an optimization process to push away samples with the same labels and pull close samples with different labels, leading to intra-class diversity and inter-class confusion, respectively. Then, in the student training phase, we perform an opposite optimization, which adversarially attempts to reduce the distance of samples of the same classes and enlarge the distance of samples of different classes. To mitigate the conflict of seeking high global diversity and keeping inter-class confusing, we propose a focal weighted sampling strategy by selecting the negative in the triplets unevenly within a finite range of distance. RGAL shows significant improvement over previous state-of-the-art methods in accuracy and data efficiency. Besides, RGAL can be inserted into state-of-the-art methods on various data-free knowledge transfer applications. Experiments on various benchmarks demonstrate the effectiveness and generalizability of our proposed method on various tasks, specially data-free knowledge distillation, data-free quantization, and non-exemplar incremental learning. Our code will be publicly available to the community.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"76 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142816370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask MaskDiffusion:用条件蒙版增强文本到图像的一致性
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-12 DOI: 10.1007/s11263-024-02294-2
Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou
{"title":"MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask","authors":"Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou","doi":"10.1007/s11263-024-02294-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02294-2","url":null,"abstract":"<p>Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between the prompt and the generated images. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in the semantic information embedding of the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can largely enhance their capability to correctly generate objects and their attributes, with negligible computation overhead compared to the original diffusion models. Our project page is https://github.com/HVision-NKU/MaskDiffusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"47 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoDA: Modeling Deformable 3D Objects from Casual Videos MoDA:从休闲视频建模可变形的3D对象
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-12 DOI: 10.1007/s11263-024-02310-5
Chaoyue Song, Jiacheng Wei, Tianyi Chen, Yiwen Chen, Chuan-Sheng Foo, Fayao Liu, Guosheng Lin
{"title":"MoDA: Modeling Deformable 3D Objects from Casual Videos","authors":"Chaoyue Song, Jiacheng Wei, Tianyi Chen, Yiwen Chen, Chuan-Sheng Foo, Fayao Liu, Guosheng Lin","doi":"10.1007/s11263-024-02310-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02310-5","url":null,"abstract":"<p>In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of NeRF, many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation. However, the linearly weighted combination of rigid transformation matrices is not guaranteed to be rigid. As a matter of fact, unexpected scale and shear factors often appear. In practice, using LBS as the deformation model can always lead to skin-collapsing artifacts for bending or twisting motions. To solve this problem, we propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can perform rigid transformation without skin-collapsing artifacts. To register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space, and 2D image features by solving an optimal transport problem. Besides, we introduce a texture filtering approach for texture rendering that effectively minimizes the impact of noisy colors outside target deformable objects.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"62 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structured Generative Models for Scene Understanding 场景理解的结构化生成模型
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-12 DOI: 10.1007/s11263-024-02316-z
Christopher K. I. Williams
{"title":"Structured Generative Models for Scene Understanding","authors":"Christopher K. I. Williams","doi":"10.1007/s11263-024-02316-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02316-z","url":null,"abstract":"<p>This position paper argues for the use of <i>structured generative models</i> (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability and editability. To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include “things” (object categories that have a well defined shape), and “stuff” (categories which have amorphous spatial extent). We then move on to review <i>scene models</i> which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is <i>inference</i> of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"200 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation 最大化信息传播的局部监督深度学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-11 DOI: 10.1007/s11263-024-02296-0
Yulin Wang, Zanlin Ni, Yifan Pu, Cai Zhou, Jixuan Ying, Shiji Song, Gao Huang
{"title":"InfoPro: Locally Supervised Deep Learning by Maximizing Information Propagation","authors":"Yulin Wang, Zanlin Ni, Yifan Pu, Cai Zhou, Jixuan Ying, Shiji Song, Gao Huang","doi":"10.1007/s11263-024-02296-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02296-0","url":null,"abstract":"<p>End-to-end (E2E) training has become the <i>de-facto</i> standard for training modern deep networks, e.g., ConvNets and vision Transformers (ViTs). Typically, a global error signal is generated at the end of a model and back-propagated layer-by-layer to update the parameters. This paper shows that the reliance on back-propagating global errors may not be necessary for deep learning. More precisely, deep networks with a competitive or even better performance can be obtained by purely leveraging locally supervised learning, i.e., splitting a network into gradient-isolated modules and training them with local supervision signals. However, such an extension is non-trivial. Our experimental and theoretical analysis demonstrates that simply training local modules with an E2E objective tends to be short-sighted, collapsing task-relevant information at early layers, and hurting the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discarding task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. We evaluate InfoPro extensively with ConvNets and ViTs, based on twelve computer vision benchmarks organized into five tasks (i.e., image/video recognition, semantic/instance segmentation, and object detection). InfoPro exhibits superior efficiency over E2E training in terms of GPU memory footprints, convergence speed, and training data scale. Moreover, InfoPro enables the effective training of more parameter- and computation-efficient models (e.g., much deeper networks), which suffer from inferior performance when trained in E2E. Code: https://github.com/blackfeather-wang/InfoPro-Pytorch.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"113 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection CMAE-3D:用于自监督3D对象检测的对比蒙面自动编码器
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-11 DOI: 10.1007/s11263-024-02313-2
Yanan Zhang, Jiaxin Chen, Di Huang
{"title":"CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection","authors":"Yanan Zhang, Jiaxin Chen, Di Huang","doi":"10.1007/s11263-024-02313-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02313-2","url":null,"abstract":"<p>LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked AutoEncoders for self-supervised 3D object detection, dubbed as CMAE-3D, which is a promising solution to effectively alleviate label dependency in 3D perception. Specifically, we integrate Contrastive Learning (CL) and Masked AutoEncoders (MAE) into one unified framework to fully utilize the complementary characteristics of global semantic representation and local spatial perception. Furthermore, from the perspective of MAE, we develop the Geometric-Semantic Hybrid Masking (GSHM) to selectively mask representative regions in point clouds with imbalanced foreground-background and uneven density distribution, and design the Multi-scale Latent Feature Reconstruction (MLFR) to capture high-level semantic features while mitigating the redundant reconstruction of low-level details. From the perspective of CL, we present Hierarchical Relational Contrastive Learning (HRCL) to mine rich semantic similarity information while alleviating the issue of negative sample mismatch from both the voxel-level and frame-level. Extensive experiments demonstrate the effectiveness of our pre-training method when applied to multiple mainstream 3D object detectors (SECOND, CenterPoint and PV-RCNN) on three popular datasets (KITTI, Waymo and nuScenes).\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Language-Guided Hierarchical Fine-Grained Image Forgery Detection and Localization 语言引导的分层细粒度图像伪造检测与定位
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-12-10 DOI: 10.1007/s11263-024-02255-9
Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu
{"title":"Language-Guided Hierarchical Fine-Grained Image Forgery Detection and Localization","authors":"Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu","doi":"10.1007/s11263-024-02255-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02255-9","url":null,"abstract":"<p>Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. In addition, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 8 different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found: github.com/CHELSEA234/HiFi-IFDL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信