{"title":"Context-Aware Multi-view Stereo Network for Efficient Edge-Preserving Depth Estimation","authors":"Wanjuan Su, Wenbing Tao","doi":"10.1007/s11263-024-02337-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02337-8","url":null,"abstract":"<p>Learning-based multi-view stereo methods have achieved great progress in recent years by employing the coarse-to-fine depth estimation framework. However, existing methods still encounter difficulties in recovering depth in featureless areas, object boundaries, and thin structures which mainly due to the poor distinguishability of matching clues in low-textured regions, the inherently smooth properties of 3D convolution neural networks used for cost volume regularization, and information loss of the coarsest scale features. To address these issues, we propose a Context-Aware multi-view stereo Network (CANet) that leverages contextual cues in images to achieve efficient edge-preserving depth estimation. The structural self-similarity information in the reference view is exploited by the introduced self-similarity attended cost aggregation module to perform long-range dependencies modeling in the cost volume, which can boost the matchability of featureless regions. The context information in the reference view is subsequently utilized to progressively refine multi-scale depth estimation through the proposed hierarchical edge-preserving residual learning module, resulting in delicate depth estimation at edges. To enrich features at the coarsest scale by making it focus more on delicate areas, a focal selection module is presented which can enhance the recovery of initial depth with finer details such as thin structure. By integrating the strategies above into the well-designed lightweight cascade framework, CANet achieves superior performance and efficiency trade-offs. Extensive experiments show that the proposed method achieves state-of-the-art performance with fast inference speed and low memory usage. Notably, CANet ranks first on challenging Tanks and Temples advanced dataset and ETH3D high-res benchmark among all published learning-based methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"39 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142935481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delving Deep into Simplicity Bias for Long-Tailed Image Recognition","authors":"Xiu-Shen Wei, Xuhao Sun, Yang Shen, Peng Wang","doi":"10.1007/s11263-024-02342-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02342-x","url":null,"abstract":"<p>Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, <i>i.e.</i>, holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142929448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relation-Guided Versatile Regularization for Federated Semi-Supervised Learning","authors":"Qiushi Yang, Zhen Chen, Zhe Peng, Yixuan Yuan","doi":"10.1007/s11263-024-02330-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02330-1","url":null,"abstract":"<p>Federated semi-supervised learning (FSSL) target to address the increasing privacy concerns for the practical scenarios, where data holders are limited in labeling capability. Latest FSSL approaches leverage the prediction consistency between the local model and global model to exploit knowledge from partially labeled or completely unlabeled clients. However, they merely utilize data-level augmentation for prediction consistency and simply aggregate model parameters through the weighted average at the server, which leads to biased classifiers and suffers from skewed unlabeled clients. To remedy these issues, we present a novel FSSL framework, Relation-guided Versatile Regularization (FedRVR), consisting of versatile regularization at clients and relation-guided directional aggregation strategy at the server. In versatile regularization, we propose the model-guided regularization together with the data-guided one, and encourage the prediction of the local model invariant to two extreme global models with different abilities, which provides richer consistency supervision for local training. Moreover, we devise a relation-guided directional aggregation at the server, in which a parametric relation predictor is introduced to yield pairwise model relation and obtain a model ranking. In this manner, the server can provide a superior global model by aggregating relative dependable client models, and further produce an inferior global model via reverse aggregation to promote the versatile regularization at clients. Extensive experiments on three FSSL benchmarks verify the superiority of FedRVR over state-of-the-art counterparts across various federated learning settings.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PICK: Predict and Mask for Semi-supervised Medical Image Segmentation","authors":"Qingjie Zeng, Zilin Lu, Yutong Xie, Yong Xia","doi":"10.1007/s11263-024-02328-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02328-9","url":null,"abstract":"<p>Pseudo-labeling and consistency-based co-training are established paradigms in semi-supervised learning. Pseudo-labeling focuses on selecting reliable pseudo-labels, while co-training emphasizes sub-network diversity for complementary information extraction. However, both paradigms struggle with the inevitable erroneous predictions from unlabeled data, which poses a risk to task-specific decoders and ultimately impact model performance. To address this challenge, we propose a PredICt-and-masK (PICK) model for semi-supervised medical image segmentation. PICK operates by masking and predicting pseudo-label-guided attentive regions to exploit unlabeled data. It features a shared encoder and three task-specific decoders. Specifically, PICK employs a primary decoder supervised solely by labeled data to generate pseudo-labels, identifying potential targets in unlabeled data. The model then masks these regions and reconstructs them using a masked image modeling (MIM) decoder, optimizing through a reconstruction task. To reconcile segmentation and reconstruction, an auxiliary decoder is further developed to learn from the reconstructed images, whose predictions are constrained by the primary decoder. We evaluate PICK on five medical benchmarks, including single organ/tumor segmentation, multi-organ segmentation, and domain-generalized tasks. Our results indicate that PICK outperforms state-of-the-art methods. The code is available at https://github.com/maxwell0027/PICK.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142929487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Class-Balanced Multicentric Dynamic Prototype Pseudo-Labeling for Source-Free Domain Adaptation","authors":"Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, Dacheng Tao","doi":"10.1007/s11263-024-02335-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02335-w","url":null,"abstract":"<p>Source-free Domain Adaptation aims to adapt a pre-trained source model to an unlabeled target domain while circumventing access to well-labeled source data. To compensate for the absence of source data, most existing approaches employ prototype-based pseudo-labeling strategies to facilitate self-training model adaptation. Nevertheless, these methods commonly rely on instance-level predictions for direct monocentric prototype construction, leading to category bias and noisy labels. This is primarily due to the inherent visual domain gaps that often differ across categories. Besides, the monocentric prototype design is ineffective and may introduce negative transfer for those ambiguous data. To tackle these challenges, we propose a general class-<b>B</b>alanced <b>M</b>ulticentric <b>D</b>ynamic (BMD) prototype strategy. Specifically, we first introduce a global inter-class balanced sampling strategy for each target category to mitigate category bias. Subsequently, we design an intra-class multicentric clustering strategy to generate robust and representative prototypes. In contrast to existing approaches that only update pseudo-labels at fixed intervals, e.g., one epoch, we employ a dynamic pseudo-labeling strategy that incorporates network update information throughout the model adaptation. We refer to the vanilla implementation of these three sub-strategies as BMD-v1. Furthermore, we promote the BMD-v1 to BMD-v2 by incorporating a consistency-guided reweighting strategy to improve inter-class balanced sampling, and leveraging the silhouettes metric to realize adaptive intra-class multicentric clustering. Extensive experiments conducted on both 2D images and 3D point cloud recognition demonstrate that our proposed BMD strategy significantly improves existing representative methods. Remarkably, BMD-v2 improves NRC from 52.6 to 59.2% in accuracy on the PointDA-10 benchmark. The code will be available at https://github.com/ispc-lab/BMD.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"159 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu
{"title":"HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning","authors":"Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu","doi":"10.1007/s11263-024-02318-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02318-x","url":null,"abstract":"<p>Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement, dubbed HUPE, which enhances visual quality and demonstrates flexibility in handling other downstream tasks. Specifically, we introduced a information-preserving reversible transformation with embedded Fourier transform to establish a bidirectional mapping between underwater images and their clear images. Additionally, a heuristic prior is incorporated into the enhancement process to better capture scene information. To further bridges the feature gap between vision-based enhancement images and application-oriented images, a semantic collaborative learning module is applied in the joint optimization process of the visual enhancement task and the downstream task, which guides the proposed enhancement model to extract more task-oriented semantic features while obtaining visually pleasing images. Extensive experiments, both quantitative and qualitative, demonstrate the superiority of our HUPE over state-of-the-art methods. The source code is available at https://github.com/ZengxiZhang/HUPE.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Sequential DeepFake Detection","authors":"Rui Shao, Tianxing Wu, Ziwei Liu","doi":"10.1007/s11263-024-02339-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02339-6","url":null,"abstract":"<p>Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting <i>one-step</i> facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using <i>multi-step</i> operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence (e.g., image captioning) task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection. Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols and metrics for this new research problem. Extensive quantitative and qualitative experiments demonstrate the effectiveness of SeqFakeFormer and SeqFakeFormer++. Several valuable observations are also revealed to facilitate future research in broader deepfake detection problems. The code has been released at https://github.com/rshaojimmy/SeqDeepFake/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"388 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingliang Zhou, Wenhao Shen, Xuekai Wei, Jun Luo, Fan Jia, Xu Zhuang, Weijia Jia
{"title":"Blind Image Quality Assessment: Exploring Content Fidelity Perceptibility via Quality Adversarial Learning","authors":"Mingliang Zhou, Wenhao Shen, Xuekai Wei, Jun Luo, Fan Jia, Xu Zhuang, Weijia Jia","doi":"10.1007/s11263-024-02338-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02338-7","url":null,"abstract":"<p>In deep learning-based no-reference image quality assessment (NR-IQA) methods, the absence of reference images limits their ability to assess content fidelity, making it difficult to distinguish between original content and distortions that degrade quality. To address this issue, we propose a quality adversarial learning framework emphasizing both content fidelity and prediction accuracy. The main contributions of this study are as follows: First, we investigate the importance of content fidelity, especially in no-reference scenarios. Second, we propose a quality adversarial learning framework that dynamically adapts and refines the image quality assessment process on the basis of the quality optimization results. The framework generates adversarial samples for the quality prediction model, and simultaneously, the quality prediction model optimizes the quality prediction model by using these adversarial samples to maintain fidelity and improve accuracy. Finally, we demonstrate that by employing the quality prediction model as a loss function for image quality optimization, our framework effectively reduces the generation of artifacts, highlighting its superior ability to preserve content fidelity. The experimental results demonstrate the validity of our method compared with state-of-the-art NR-IQA methods. The code is publicly available at the following website: https://github.com/Land5cape/QAL-IQA.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RepSNet: A Nucleus Instance Segmentation Model Based on Boundary Regression and Structural Re-Parameterization","authors":"Shengchun Xiong, Xiangru Li, Yunpeng Zhong, Wanfen Peng","doi":"10.1007/s11263-024-02332-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02332-z","url":null,"abstract":"<p>Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. In the experimental comparisons and evaluations on the Lizard dataset, RepSNet demonstrated superior segmentation accuracy and inference speed compared to several typical benchmark models. The experimental code, dataset splitting configuration and the pre-trained model were released at https://github.com/luckyrz0/RepSNet.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pseudo-Plane Regularized Signed Distance Field for Neural Indoor Scene Reconstruction","authors":"Jing Li, Jinpeng Yu, Ruoyu Wang, Shenghua Gao","doi":"10.1007/s11263-024-02319-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02319-w","url":null,"abstract":"<p>Given only a set of images, neural implicit surface representation has shown its capability in 3D surface reconstruction. However, as the nature of per-scene optimization is based on the volumetric rendering of color, previous neural implicit surface reconstruction methods usually fail in the low-textured regions, including floors, walls, etc., which commonly exist for indoor scenes. Being aware of the fact that these low-textured regions usually correspond to planes, without introducing additional ground-truth supervisory signals or making additional assumptions about the room layout, we propose to leverage a novel Pseudo-plane regularized Signed Distance Field (PPlaneSDF) for indoor scene reconstruction. Specifically, we consider adjacent pixels with similar colors to be on the same pseudo-planes. The plane parameters are then estimated on the fly during training by an efficient and effective two-step scheme. Then the signed distances of the points on the planes are regularized by the estimated plane parameters in the training phase. As the unsupervised plane segments are usually noisy and inaccurate, we propose to assign different weights to the sampled points on the plane in plane estimation as well as the regularization loss. The weights come by fusing the plane segments from different views. As the sampled rays in the planar regions are redundant, leading to inefficient training, we further propose a keypoint-guided rays sampling strategy that attends to the informative textured regions with large color variations, and the implicit network gets a better reconstruction, compared with the original uniform ray sampling strategy. Experiments show that our PPlaneSDF achieves competitive reconstruction performance in Manhattan scenes. Further, as we do not introduce any additional room layout assumption, our PPlaneSDF generalizes well to the reconstruction of non-Manhattan scenes.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}