Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo
{"title":"RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond","authors":"Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo","doi":"10.1007/s11263-024-02329-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02329-8","url":null,"abstract":"<p>GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, <b>R</b>ecurrent v<b>I</b>deo <b>G</b>AN <b>I</b>nversion and e<b>D</b>iting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly and simultaneously. Our approach models temporal relations between current and previous frames in three ways: (1) by maximizing inversion fidelity and consistency through learning a temporally compensated latent code and spatial features, (2) by disentangling high-frequency incoherent noises from the latent space, and (3) by introducing an in-between frame composition constraint to eliminate inconsistency after attribute manipulation, ensuring that each frame is a direct composite of its neighbors. Compared to existing video- and attribute-specific works, RIGID eliminates the need for expensive re-training of the model, resulting in approximately 60<span>(times )</span> faster performance. Furthermore, RIGID can be easily extended to other face domains, showcasing its versatility and adaptability. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods in inversion and editing tasks both qualitatively and quantitatively.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"121 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Meshing from Delaunay Triangulation for 3D Shape Representation","authors":"Chen Zhang, Wenbing Tao","doi":"10.1007/s11263-024-02344-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02344-9","url":null,"abstract":"<p>Recently, there has been a growing interest in learning-based explicit methods due to their ability to respect the original input and preserve details. However, the connectivity on complex structures is still difficult to infer due to the limited local shape perception, resulting in artifacts and non-watertight triangles. In this paper, we present a novel learning-based method with Delaunay triangulation to achieve high-precision reconstruction. We model the Delaunay triangulation as a dual graph, extract multi-scale geometric information from the points, and embed it into the structural representation of Delaunay triangulation in an organic way, benefiting fine-grained details reconstruction. To encourage neighborhood information interaction of edges and nodes in the graph, we introduce a Local Graph Iteration algorithm, serving as a variant of graph neural network. Benefiting from its robust local processing for dual graph, a scaling strategy is designed to enable large-scale reconstruction. Moreover, due to the complicated spatial relations between tetrahedrons and the ground truth surface, it is hard to directly generate ground truth labels of tetrahedrons for supervision. Therefore, we propose a multi-label supervision strategy, which is integrated in the loss we design for this task and allows our method to obtain robust labeling without visibility information. Experiments show that our method yields watertight and high-quality meshes. Especially for some thin structures and sharp edges, our method shows better performance than the current state-of-the-art methods. Furthermore, it has a strong adaptability to point clouds of different densities.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142939983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models","authors":"Angus Fung, Beno Benhabib, Goldie Nejat","doi":"10.1007/s11263-024-02336-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02336-9","url":null,"abstract":"<p>Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (<i>LDTrack</i>), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial–temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of <i>LDTrack</i> over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of <i>LDTrack</i>. An ablation study was performed to validate the design choices of <i>LDTrack</i>.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-Aware Multi-view Stereo Network for Efficient Edge-Preserving Depth Estimation","authors":"Wanjuan Su, Wenbing Tao","doi":"10.1007/s11263-024-02337-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02337-8","url":null,"abstract":"<p>Learning-based multi-view stereo methods have achieved great progress in recent years by employing the coarse-to-fine depth estimation framework. However, existing methods still encounter difficulties in recovering depth in featureless areas, object boundaries, and thin structures which mainly due to the poor distinguishability of matching clues in low-textured regions, the inherently smooth properties of 3D convolution neural networks used for cost volume regularization, and information loss of the coarsest scale features. To address these issues, we propose a Context-Aware multi-view stereo Network (CANet) that leverages contextual cues in images to achieve efficient edge-preserving depth estimation. The structural self-similarity information in the reference view is exploited by the introduced self-similarity attended cost aggregation module to perform long-range dependencies modeling in the cost volume, which can boost the matchability of featureless regions. The context information in the reference view is subsequently utilized to progressively refine multi-scale depth estimation through the proposed hierarchical edge-preserving residual learning module, resulting in delicate depth estimation at edges. To enrich features at the coarsest scale by making it focus more on delicate areas, a focal selection module is presented which can enhance the recovery of initial depth with finer details such as thin structure. By integrating the strategies above into the well-designed lightweight cascade framework, CANet achieves superior performance and efficiency trade-offs. Extensive experiments show that the proposed method achieves state-of-the-art performance with fast inference speed and low memory usage. Notably, CANet ranks first on challenging Tanks and Temples advanced dataset and ETH3D high-res benchmark among all published learning-based methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"39 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142935481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delving Deep into Simplicity Bias for Long-Tailed Image Recognition","authors":"Xiu-Shen Wei, Xuhao Sun, Yang Shen, Peng Wang","doi":"10.1007/s11263-024-02342-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02342-x","url":null,"abstract":"<p>Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, <i>i.e.</i>, holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142929448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relation-Guided Versatile Regularization for Federated Semi-Supervised Learning","authors":"Qiushi Yang, Zhen Chen, Zhe Peng, Yixuan Yuan","doi":"10.1007/s11263-024-02330-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02330-1","url":null,"abstract":"<p>Federated semi-supervised learning (FSSL) target to address the increasing privacy concerns for the practical scenarios, where data holders are limited in labeling capability. Latest FSSL approaches leverage the prediction consistency between the local model and global model to exploit knowledge from partially labeled or completely unlabeled clients. However, they merely utilize data-level augmentation for prediction consistency and simply aggregate model parameters through the weighted average at the server, which leads to biased classifiers and suffers from skewed unlabeled clients. To remedy these issues, we present a novel FSSL framework, Relation-guided Versatile Regularization (FedRVR), consisting of versatile regularization at clients and relation-guided directional aggregation strategy at the server. In versatile regularization, we propose the model-guided regularization together with the data-guided one, and encourage the prediction of the local model invariant to two extreme global models with different abilities, which provides richer consistency supervision for local training. Moreover, we devise a relation-guided directional aggregation at the server, in which a parametric relation predictor is introduced to yield pairwise model relation and obtain a model ranking. In this manner, the server can provide a superior global model by aggregating relative dependable client models, and further produce an inferior global model via reverse aggregation to promote the versatile regularization at clients. Extensive experiments on three FSSL benchmarks verify the superiority of FedRVR over state-of-the-art counterparts across various federated learning settings.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PICK: Predict and Mask for Semi-supervised Medical Image Segmentation","authors":"Qingjie Zeng, Zilin Lu, Yutong Xie, Yong Xia","doi":"10.1007/s11263-024-02328-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02328-9","url":null,"abstract":"<p>Pseudo-labeling and consistency-based co-training are established paradigms in semi-supervised learning. Pseudo-labeling focuses on selecting reliable pseudo-labels, while co-training emphasizes sub-network diversity for complementary information extraction. However, both paradigms struggle with the inevitable erroneous predictions from unlabeled data, which poses a risk to task-specific decoders and ultimately impact model performance. To address this challenge, we propose a PredICt-and-masK (PICK) model for semi-supervised medical image segmentation. PICK operates by masking and predicting pseudo-label-guided attentive regions to exploit unlabeled data. It features a shared encoder and three task-specific decoders. Specifically, PICK employs a primary decoder supervised solely by labeled data to generate pseudo-labels, identifying potential targets in unlabeled data. The model then masks these regions and reconstructs them using a masked image modeling (MIM) decoder, optimizing through a reconstruction task. To reconcile segmentation and reconstruction, an auxiliary decoder is further developed to learn from the reconstructed images, whose predictions are constrained by the primary decoder. We evaluate PICK on five medical benchmarks, including single organ/tumor segmentation, multi-organ segmentation, and domain-generalized tasks. Our results indicate that PICK outperforms state-of-the-art methods. The code is available at https://github.com/maxwell0027/PICK.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142929487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Class-Balanced Multicentric Dynamic Prototype Pseudo-Labeling for Source-Free Domain Adaptation","authors":"Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, Dacheng Tao","doi":"10.1007/s11263-024-02335-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02335-w","url":null,"abstract":"<p>Source-free Domain Adaptation aims to adapt a pre-trained source model to an unlabeled target domain while circumventing access to well-labeled source data. To compensate for the absence of source data, most existing approaches employ prototype-based pseudo-labeling strategies to facilitate self-training model adaptation. Nevertheless, these methods commonly rely on instance-level predictions for direct monocentric prototype construction, leading to category bias and noisy labels. This is primarily due to the inherent visual domain gaps that often differ across categories. Besides, the monocentric prototype design is ineffective and may introduce negative transfer for those ambiguous data. To tackle these challenges, we propose a general class-<b>B</b>alanced <b>M</b>ulticentric <b>D</b>ynamic (BMD) prototype strategy. Specifically, we first introduce a global inter-class balanced sampling strategy for each target category to mitigate category bias. Subsequently, we design an intra-class multicentric clustering strategy to generate robust and representative prototypes. In contrast to existing approaches that only update pseudo-labels at fixed intervals, e.g., one epoch, we employ a dynamic pseudo-labeling strategy that incorporates network update information throughout the model adaptation. We refer to the vanilla implementation of these three sub-strategies as BMD-v1. Furthermore, we promote the BMD-v1 to BMD-v2 by incorporating a consistency-guided reweighting strategy to improve inter-class balanced sampling, and leveraging the silhouettes metric to realize adaptive intra-class multicentric clustering. Extensive experiments conducted on both 2D images and 3D point cloud recognition demonstrate that our proposed BMD strategy significantly improves existing representative methods. Remarkably, BMD-v2 improves NRC from 52.6 to 59.2% in accuracy on the PointDA-10 benchmark. The code will be available at https://github.com/ispc-lab/BMD.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"159 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu
{"title":"HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning","authors":"Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu","doi":"10.1007/s11263-024-02318-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02318-x","url":null,"abstract":"<p>Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement, dubbed HUPE, which enhances visual quality and demonstrates flexibility in handling other downstream tasks. Specifically, we introduced a information-preserving reversible transformation with embedded Fourier transform to establish a bidirectional mapping between underwater images and their clear images. Additionally, a heuristic prior is incorporated into the enhancement process to better capture scene information. To further bridges the feature gap between vision-based enhancement images and application-oriented images, a semantic collaborative learning module is applied in the joint optimization process of the visual enhancement task and the downstream task, which guides the proposed enhancement model to extract more task-oriented semantic features while obtaining visually pleasing images. Extensive experiments, both quantitative and qualitative, demonstrate the superiority of our HUPE over state-of-the-art methods. The source code is available at https://github.com/ZengxiZhang/HUPE.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Sequential DeepFake Detection","authors":"Rui Shao, Tianxing Wu, Ziwei Liu","doi":"10.1007/s11263-024-02339-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02339-6","url":null,"abstract":"<p>Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting <i>one-step</i> facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using <i>multi-step</i> operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence (e.g., image captioning) task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection. Moreover, we build a comprehensive benchmark and set up rigorous evaluation protocols and metrics for this new research problem. Extensive quantitative and qualitative experiments demonstrate the effectiveness of SeqFakeFormer and SeqFakeFormer++. Several valuable observations are also revealed to facilitate future research in broader deepfake detection problems. The code has been released at https://github.com/rshaojimmy/SeqDeepFake/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"388 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}