International Journal of Computer Vision最新文献

筛选
英文 中文
Fusion4DAL: Offline Multi-modal 3D Object Detection for 4D Auto-labeling
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-15 DOI: 10.1007/s11263-025-02370-1
Zhiyuan Yang, Xuekuan Wang, Wei Zhang, Xiao Tan, Jincheng Lu, Jingdong Wang, Errui Ding, Cairong Zhao
{"title":"Fusion4DAL: Offline Multi-modal 3D Object Detection for 4D Auto-labeling","authors":"Zhiyuan Yang, Xuekuan Wang, Wei Zhang, Xiao Tan, Jincheng Lu, Jingdong Wang, Errui Ding, Cairong Zhao","doi":"10.1007/s11263-025-02370-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02370-1","url":null,"abstract":"<p>Integrating LiDAR and camera information has been a widely adopted approach for 3D object detection in autonomous driving. Nevertheless, the unexplored potential of multi-modal fusion remains in the realm of offline 4D detection. We experimentally find that the root lies in two reasons: (1) the sparsity of point clouds poses a challenge in extracting long-term image features and thereby results in information loss. (2) some of the LiDAR points may be obstructed in the image, leading to incorrect image features. To tackle these problems, we first propose a simple yet effective offline multi-modal 3D object detection method, named Fusion4DAL, for 4D auto-labeling with long-term multi-modal sequences. Specifically, in order to address the sparsity of points within objects, we propose a multi-modal mixed feature fusion module (MMFF). In the MMFF module, we introduce virtual points based on a dense 3D grid and combine them with real LiDAR points. The mixed points are then utilized to extract dense point-level image features, thereby enhancing multi-modal feature fusion without being constrained by the sparse real LiDAR points. As to the obstructed LiDAR points, we leverage the occlusion relationship among objects to ensure depth consistency between LiDAR points and their corresponding depth feature maps, thus filtering out erroneous image features. In addition, we define a virtual point loss (VP Loss) to distinguish different types of mixed points and preserve the geometric shape of objects. Furthermore, in order to promote long-term receptive field and capture finer-grained features, we propose a global point attention decoder with a box-level self-attention module and a global point attention module. Finally, comprehensive experiments show that Fusion4DAL outperforms state-of-the-art offline 3D detection methods on nuScenes and Waymo dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143417813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-13 DOI: 10.1007/s11263-024-02327-w
Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
{"title":"An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training","authors":"Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu","doi":"10.1007/s11263-024-02327-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02327-w","url":null,"abstract":"<p>Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the <i>extremely simple</i> lightweight ViTs’ fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically <b>observe</b> different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we <b>analyze</b> the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to <b>solve</b> the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7<i>M</i>/6.5<i>M</i>) can achieve <span>(79.4%)</span>/<span>(78.9%)</span> top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task (<span>(42.8%)</span> mIoU) and LaSOT tracking task (<span>(66.1%)</span> AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143401621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-12 DOI: 10.1007/s11263-025-02353-2
Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin
{"title":"Smaller But Better: Unifying Layout Generation with Smaller Large Language Models","authors":"Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin","doi":"10.1007/s11263-025-02353-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02353-2","url":null,"abstract":"<p>We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"77 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143393217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LiDAR-guided Geometric Pretraining for Vision-Centric 3D Object Detection
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-09 DOI: 10.1007/s11263-025-02351-4
Linyan Huang, Huijie Wang, Jia Zeng, Shengchuan Zhang, Liujuan Cao, Junchi Yan, Hongyang Li
{"title":"LiDAR-guided Geometric Pretraining for Vision-Centric 3D Object Detection","authors":"Linyan Huang, Huijie Wang, Jia Zeng, Shengchuan Zhang, Liujuan Cao, Junchi Yan, Hongyang Li","doi":"10.1007/s11263-025-02351-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02351-4","url":null,"abstract":"<p>Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called <b>GAPretrain</b>. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird’s-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes <i>val</i> set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143371605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-07 DOI: 10.1007/s11263-025-02355-0
Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu
{"title":"Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation","authors":"Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu","doi":"10.1007/s11263-025-02355-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02355-0","url":null,"abstract":"<p>General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints. Extensive experimental results obtained on the widely used animal datasets, AP10K and AnimalKingdom, demonstrate the superiority of the proposed method over the state-of-the-art approaches. It achieves 77.9 AP on the AP10K <i>val</i> set, outperforming HRFormer by 2.2. Besides, our KITPose can be directly transferred to human pose estimation with promising results, as evaluated on COCO, reflecting the merits of constructing structure-supporting architectures for general mammal pose estimation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143258495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Boosting Out-of-Distribution Detection from a Spatial Feature Importance Perspective
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-05 DOI: 10.1007/s11263-025-02347-0
Yao Zhu, Xiu Yan, Chuanlong Xie
{"title":"Towards Boosting Out-of-Distribution Detection from a Spatial Feature Importance Perspective","authors":"Yao Zhu, Xiu Yan, Chuanlong Xie","doi":"10.1007/s11263-025-02347-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02347-0","url":null,"abstract":"<p>In ensuring the reliable and secure operation of models, Out-of-Distribution (OOD) detection has gained widespread attention in recent years. Researchers have proposed various promising detection criteria to construct the rejection region of the model, treating samples falling into this region as out-of-distribution. However, these detection criteria are computed using all dense features of the model (before the pooling layer), overlooking the fact that different features may exhibit varying importance in the decision-making process. To this end, we first propose quantifying the contribution of different spatial positions in the dense features to the decision result with the Shapley Value, thereby obtaining the feature importance map. This spatial-oriented feature attribution method, compared to the classical channel-oriented feature attribution method that linearly combines weights with each activation map, achieves superior visual performance and fidelity in interpreting the decision-making process. Subsequently, we introduce a spatial feature purification method that removes spatial features with low importance in dense features and advocates using these purified features to compute detection criteria. Extensive experiments demonstrate the effectiveness of spatial feature purification in enhancing the performance of various existing detection methods. Notably, spatial feature purification boosts Energy Score and NNGuide on the ImageNet benchmark by 18.39<span>(%)</span> and 26.45<span>(%)</span> in average FPR95, respectively.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation 用于语音保全面部表情操纵的对比解耦表征学习和正则化
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-04 DOI: 10.1007/s11263-025-02358-x
Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin
{"title":"Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation","authors":"Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin","doi":"10.1007/s11263-025-02358-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02358-x","url":null,"abstract":"<p>Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"39 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143083462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffuVolume: Diffusion Model for Volume based Stereo Matching
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-01 DOI: 10.1007/s11263-025-02362-1
Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-Shi Zheng
{"title":"DiffuVolume: Diffusion Model for Volume based Stereo Matching","authors":"Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-Shi Zheng","doi":"10.1007/s11263-025-02362-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02362-1","url":null,"abstract":"<p>Stereo matching is a significant part in many computer vision tasks and driving-based applications. Recently cost volume-based methods have achieved great success benefiting from the rich geometry information in paired images. However, the redundancy of cost volume also interferes with the model training and limits the performance. To construct a more precise cost volume, we pioneeringly apply the diffusion model to stereo matching. Our method, termed DiffuVolume, considers the diffusion model as a cost volume filter, which will recurrently remove the redundant information from the cost volume. Two main designs make our method not trivial. Firstly, to make the diffusion model more adaptive to stereo matching, we eschew the traditional manner of directly adding noise into the image but embed the diffusion model into a task-specific module. In this way, we outperform the traditional diffusion stereo matching method by 27<span>(%)</span> EPE improvement and 7 times parameters reduction. Secondly, DiffuVolume can be easily embedded into any volume-based stereo matching network, boosting performance with only a slight increase in parameters (approximately 2<span>(%)</span>). By adding the DiffuVolume into well-performed methods, we outperform all the published methods on Scene Flow, KITTI2012, KITTI2015 benchmarks and zero-shot generalization setting. It is worth mentioning that the proposed model ranks 1<i>st</i> on KITTI 2012 leader board, 2<i>nd</i> on KITTI 2015 leader board since 15, July 2023.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-01-31 DOI: 10.1007/s11263-025-02352-3
Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang
{"title":"TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On","authors":"Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang","doi":"10.1007/s11263-025-02352-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02352-3","url":null,"abstract":"<p>Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"51 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-supervised Shutter Unrolling with Events
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-01-29 DOI: 10.1007/s11263-025-02364-z
Mingyuan Lin, Yangguang Wang, Xiang Zhang, Boxin Shi, Wen Yang, Chu He, Gui-song Xia, Lei Yu
{"title":"Self-supervised Shutter Unrolling with Events","authors":"Mingyuan Lin, Yangguang Wang, Xiang Zhang, Boxin Shi, Wen Yang, Chu He, Gui-song Xia, Lei Yu","doi":"10.1007/s11263-025-02364-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02364-z","url":null,"abstract":"<p>Continuous-time Global Shutter Video Recovery (CGVR) faces a substantial challenge in recovering undistorted high frame-rate Global Shutter (GS) videos from distorted Rolling Shutter (RS) images. This problem is severely ill-posed due to the absence of temporal dynamic information within RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions is unavailable. Commonly used artificial assumptions on scenes/motions and data-specific characteristics are prone to producing sub-optimal solutions in real-world scenarios. To address this challenge, we propose an event-based CGVR network within a self-supervised learning paradigm, <i>i.e.</i>, SelfUnroll, and leverage the extremely high temporal resolution of event cameras to provide accurate inter/intra-frame dynamic information. Specifically, an Event-based Inter/intra-frame Compensator (E-IC) is proposed to predict the per-pixel dynamic between arbitrary time intervals, including the temporal transition and spatial translation. Exploring connections in terms of RS-RS, RS-GS, and GS-RS, we explicitly formulate mutual constraints with the proposed E-IC, resulting in supervisions without ground-truth GS images. Extensive evaluations over synthetic and real datasets demonstrate that the proposed method achieves state-of-the-art methods and shows remarkable performance for event-based RS2GS inversion in real-world scenarios. The dataset and code are available at https://w3un.github.io/selfunroll/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"36 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143055414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信