{"title":"PICK: Predict and Mask for Semi-supervised Medical Image Segmentation","authors":"Qingjie Zeng, Zilin Lu, Yutong Xie, Yong Xia","doi":"10.1007/s11263-024-02328-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02328-9","url":null,"abstract":"<p>Pseudo-labeling and consistency-based co-training are established paradigms in semi-supervised learning. Pseudo-labeling focuses on selecting reliable pseudo-labels, while co-training emphasizes sub-network diversity for complementary information extraction. However, both paradigms struggle with the inevitable erroneous predictions from unlabeled data, which poses a risk to task-specific decoders and ultimately impact model performance. To address this challenge, we propose a PredICt-and-masK (PICK) model for semi-supervised medical image segmentation. PICK operates by masking and predicting pseudo-label-guided attentive regions to exploit unlabeled data. It features a shared encoder and three task-specific decoders. Specifically, PICK employs a primary decoder supervised solely by labeled data to generate pseudo-labels, identifying potential targets in unlabeled data. The model then masks these regions and reconstructs them using a masked image modeling (MIM) decoder, optimizing through a reconstruction task. To reconcile segmentation and reconstruction, an auxiliary decoder is further developed to learn from the reconstructed images, whose predictions are constrained by the primary decoder. We evaluate PICK on five medical benchmarks, including single organ/tumor segmentation, multi-organ segmentation, and domain-generalized tasks. Our results indicate that PICK outperforms state-of-the-art methods. The code is available at https://github.com/maxwell0027/PICK.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142929487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Class-Balanced Multicentric Dynamic Prototype Pseudo-Labeling for Source-Free Domain Adaptation","authors":"Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, Dacheng Tao","doi":"10.1007/s11263-024-02335-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02335-w","url":null,"abstract":"<p>Source-free Domain Adaptation aims to adapt a pre-trained source model to an unlabeled target domain while circumventing access to well-labeled source data. To compensate for the absence of source data, most existing approaches employ prototype-based pseudo-labeling strategies to facilitate self-training model adaptation. Nevertheless, these methods commonly rely on instance-level predictions for direct monocentric prototype construction, leading to category bias and noisy labels. This is primarily due to the inherent visual domain gaps that often differ across categories. Besides, the monocentric prototype design is ineffective and may introduce negative transfer for those ambiguous data. To tackle these challenges, we propose a general class-<b>B</b>alanced <b>M</b>ulticentric <b>D</b>ynamic (BMD) prototype strategy. Specifically, we first introduce a global inter-class balanced sampling strategy for each target category to mitigate category bias. Subsequently, we design an intra-class multicentric clustering strategy to generate robust and representative prototypes. In contrast to existing approaches that only update pseudo-labels at fixed intervals, e.g., one epoch, we employ a dynamic pseudo-labeling strategy that incorporates network update information throughout the model adaptation. We refer to the vanilla implementation of these three sub-strategies as BMD-v1. Furthermore, we promote the BMD-v1 to BMD-v2 by incorporating a consistency-guided reweighting strategy to improve inter-class balanced sampling, and leveraging the silhouettes metric to realize adaptive intra-class multicentric clustering. Extensive experiments conducted on both 2D images and 3D point cloud recognition demonstrate that our proposed BMD strategy significantly improves existing representative methods. Remarkably, BMD-v2 improves NRC from 52.6 to 59.2% in accuracy on the PointDA-10 benchmark. The code will be available at https://github.com/ispc-lab/BMD.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"159 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu
{"title":"HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning","authors":"Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu","doi":"10.1007/s11263-024-02318-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02318-x","url":null,"abstract":"<p>Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement, dubbed HUPE, which enhances visual quality and demonstrates flexibility in handling other downstream tasks. Specifically, we introduced a information-preserving reversible transformation with embedded Fourier transform to establish a bidirectional mapping between underwater images and their clear images. Additionally, a heuristic prior is incorporated into the enhancement process to better capture scene information. To further bridges the feature gap between vision-based enhancement images and application-oriented images, a semantic collaborative learning module is applied in the joint optimization process of the visual enhancement task and the downstream task, which guides the proposed enhancement model to extract more task-oriented semantic features while obtaining visually pleasing images. Extensive experiments, both quantitative and qualitative, demonstrate the superiority of our HUPE over state-of-the-art methods. The source code is available at https://github.com/ZengxiZhang/HUPE.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Sequential DeepFake Detection","authors":"Rui Shao, Tianxing Wu, Ziwei Liu","doi":"10.1007/s11263-024-02339-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02339-6","url":null,"abstract":"<p>Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting <i>one-step</i> facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using <i>multi-step</i> operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence (e.g., image captioning) task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection. Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols and metrics for this new research problem. Extensive quantitative and qualitative experiments demonstrate the effectiveness of SeqFakeFormer and SeqFakeFormer++. Several valuable observations are also revealed to facilitate future research in broader deepfake detection problems. The code has been released at https://github.com/rshaojimmy/SeqDeepFake/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"388 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingliang Zhou, Wenhao Shen, Xuekai Wei, Jun Luo, Fan Jia, Xu Zhuang, Weijia Jia
{"title":"Blind Image Quality Assessment: Exploring Content Fidelity Perceptibility via Quality Adversarial Learning","authors":"Mingliang Zhou, Wenhao Shen, Xuekai Wei, Jun Luo, Fan Jia, Xu Zhuang, Weijia Jia","doi":"10.1007/s11263-024-02338-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02338-7","url":null,"abstract":"<p>In deep learning-based no-reference image quality assessment (NR-IQA) methods, the absence of reference images limits their ability to assess content fidelity, making it difficult to distinguish between original content and distortions that degrade quality. To address this issue, we propose a quality adversarial learning framework emphasizing both content fidelity and prediction accuracy. The main contributions of this study are as follows: First, we investigate the importance of content fidelity, especially in no-reference scenarios. Second, we propose a quality adversarial learning framework that dynamically adapts and refines the image quality assessment process on the basis of the quality optimization results. The framework generates adversarial samples for the quality prediction model, and simultaneously, the quality prediction model optimizes the quality prediction model by using these adversarial samples to maintain fidelity and improve accuracy. Finally, we demonstrate that by employing the quality prediction model as a loss function for image quality optimization, our framework effectively reduces the generation of artifacts, highlighting its superior ability to preserve content fidelity. The experimental results demonstrate the validity of our method compared with state-of-the-art NR-IQA methods. The code is publicly available at the following website: https://github.com/Land5cape/QAL-IQA.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RepSNet: A Nucleus Instance Segmentation Model Based on Boundary Regression and Structural Re-Parameterization","authors":"Shengchun Xiong, Xiangru Li, Yunpeng Zhong, Wanfen Peng","doi":"10.1007/s11263-024-02332-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02332-z","url":null,"abstract":"<p>Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. In the experimental comparisons and evaluations on the Lizard dataset, RepSNet demonstrated superior segmentation accuracy and inference speed compared to several typical benchmark models. The experimental code, dataset splitting configuration and the pre-trained model were released at https://github.com/luckyrz0/RepSNet.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pseudo-Plane Regularized Signed Distance Field for Neural Indoor Scene Reconstruction","authors":"Jing Li, Jinpeng Yu, Ruoyu Wang, Shenghua Gao","doi":"10.1007/s11263-024-02319-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02319-w","url":null,"abstract":"<p>Given only a set of images, neural implicit surface representation has shown its capability in 3D surface reconstruction. However, as the nature of per-scene optimization is based on the volumetric rendering of color, previous neural implicit surface reconstruction methods usually fail in the low-textured regions, including floors, walls, etc., which commonly exist for indoor scenes. Being aware of the fact that these low-textured regions usually correspond to planes, without introducing additional ground-truth supervisory signals or making additional assumptions about the room layout, we propose to leverage a novel Pseudo-plane regularized Signed Distance Field (PPlaneSDF) for indoor scene reconstruction. Specifically, we consider adjacent pixels with similar colors to be on the same pseudo-planes. The plane parameters are then estimated on the fly during training by an efficient and effective two-step scheme. Then the signed distances of the points on the planes are regularized by the estimated plane parameters in the training phase. As the unsupervised plane segments are usually noisy and inaccurate, we propose to assign different weights to the sampled points on the plane in plane estimation as well as the regularization loss. The weights come by fusing the plane segments from different views. As the sampled rays in the planar regions are redundant, leading to inefficient training, we further propose a keypoint-guided rays sampling strategy that attends to the informative textured regions with large color variations, and the implicit network gets a better reconstruction, compared with the original uniform ray sampling strategy. Experiments show that our PPlaneSDF achieves competitive reconstruction performance in Manhattan scenes. Further, as we do not introduce any additional room layout assumption, our PPlaneSDF generalizes well to the reconstruction of non-Manhattan scenes.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Huang, Yan Huang, Zhang Zhang, Qiang Wu, Yi Zhong, Liang Wang
{"title":"CSFRNet: Integrating Clothing Status Awareness for Long-Term Person Re-identification","authors":"Yan Huang, Yan Huang, Zhang Zhang, Qiang Wu, Yi Zhong, Liang Wang","doi":"10.1007/s11263-024-02315-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02315-0","url":null,"abstract":"<p>Addressing the dynamic nature of long-term person re-identification (LT-reID) amid varying clothing conditions necessitates a departure from conventional methods. Traditional LT-reID strategies, mainly biometrics-based and data adaptation-based, each have their pitfalls. The former falters in environments lacking high-quality biometric data, while the latter loses efficacy with minimal or subtle clothing changes. To overcome these obstacles, we propose the clothing status-aware feature regularization network (CSFRNet). This novel approach seamlessly incorporates clothing status awareness into the feature learning process, significantly enhancing the adaptability and accuracy of LT-reID systems where clothing can either change completely, partially, or not at all over time, without the need for explicit clothing labels. The versatility of our CSFRNet is showcased on diverse LT-reID benchmarks, including Celeb-reID, Celeb-reID-light, PRCC, DeepChange, and LTCC, marking a significant advancement in the field by addressing the real-world variability of clothing in LT-reID scenarios.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142901709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AniClipart: Clipart Animation with Text-to-Video Priors","authors":"Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao","doi":"10.1007/s11263-024-02306-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02306-1","url":null,"abstract":"<p>Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combating Label Noise with a General Surrogate Model for Sample Selection","authors":"Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang","doi":"10.1007/s11263-024-02324-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02324-z","url":null,"abstract":"<p>Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}