Shangquan Sun, Wenqi Ren, Jingyang Peng, Fenglong Song, Xiaochun Cao
{"title":"DI-Retinex: Digital-Imaging Retinex Model for Low-Light Image Enhancement","authors":"Shangquan Sun, Wenqi Ren, Jingyang Peng, Fenglong Song, Xiaochun Cao","doi":"10.1007/s11263-025-02542-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02542-z","url":null,"abstract":"<p>Many existing methods for low-light image enhancement (LLIE) based on Retinex model ignore important factors that affect the validity of this model in digital imaging, such as noise, quantization error, non-linearity, and dynamic range overflow. In this paper, we propose a new expression called Digital-Imaging Retinex model (DI-Retinex) through theoretical and experimental analysis of Retinex model in digital imaging. Our new expression includes an offset term in the enhancement model, which allows for pixel-wise brightness contrast adjustment with a non-linear mapping function. In addition, to solve the low-light enhancement problem in an unsupervised manner, we propose an image-adaptive masked degradation loss in Gamma space. We also design a variance suppression loss for regulating the additional offset term. Extensive experiments show that our proposed method outperforms all existing unsupervised methods in terms of visual quality, model size, and speed. Our algorithm can also assist downstream face detectors in low-light, as it shows the most performance gain after the low-light enhancement compared to other methods. We have released our code and model weights on https://github.com/sunshangquan/Di-Retinex.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145007129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization","authors":"Jungin Park, Jiyoung Lee, Kwanghoon Sohn","doi":"10.1007/s11263-025-02577-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02577-2","url":null,"abstract":"<p>Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called <i>VideoGraph</i>, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"8 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144995790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuang Xu, Zixiang Zhao, Xiangyong Cao, Jiangjun Peng, Xi-Le Zhao, Deyu Meng, Yulun Zhang, Radu Timofte, Luc Van Gool
{"title":"Parameterized Low-Rank Regularizer for High-dimensional Visual Data","authors":"Shuang Xu, Zixiang Zhao, Xiangyong Cao, Jiangjun Peng, Xi-Le Zhao, Deyu Meng, Yulun Zhang, Radu Timofte, Luc Van Gool","doi":"10.1007/s11263-025-02569-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02569-2","url":null,"abstract":"<p>Factorization models and nuclear norms, two prominent methods for characterizing the low-rank prior, encounter challenges in accurately retrieving low-rank data under severe degradation and lack generalization capabilities. To mitigate these limitations, we propose a Parameterized Low-Rank Regularizer (PLRR), which models low-rank visual data through matrix factorization by utilizing neural networks to parameterize the factor matrices, whose feasible domains are essentially constrained. This approach can be interpreted as imposing an automatically learned penalty on factor matrices. More significantly, the knowledge encoded in network parameters enhances generalization. As a versatile low-rank modeling tool, PLRR exhibits superior performance in various inverse problems, including video foreground extraction, hyperspectral image (HSI) denoising, HSI inpainting, multi-temporal multispectral image (MSI) decloud, and MSI guided blind HSI super-resolution. More significantly, PLRR demonstrates robust generalization capabilities for images with diverse degradations, temporal variations, and scene contexts.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144995789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EdgeSAM: Prompt-In-the-Loop Distillation for SAM","authors":"Chong Zhou, Xiangtai Li, Chen Change Loy, Bo Dai","doi":"10.1007/s11263-025-02562-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02562-9","url":null,"abstract":"<p>This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM.To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder.As a result, EdgeSAM achieves a 37-fold speed increase compared to the original SAM, and it also outperforms MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6, respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and demo are available here https://mmlab-ntu.github.io/project/edgesam/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Curvature Learning for Generalization of Hyperbolic Neural Networks","authors":"Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia","doi":"10.1007/s11263-025-02567-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02567-4","url":null,"abstract":"<p>Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictive Display for Teleoperation Based on Vector Fields Using Lidar-Camera Fusion","authors":"Gaurav Sharma, Jeff Calder, Rajesh Rajamani","doi":"10.1007/s11263-025-02550-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02550-z","url":null,"abstract":"<p>Teleoperation can enable human intervention to help handle instances of failure in autonomy thus allowing for much safer deployment of autonomous vehicle technology. Successful teleoperation requires recreating the environment around the remote vehicle using camera data received over wireless communication channels. This paper develops a new predictive display system to tackle the significant time delays encountered in receiving camera data over wireless networks. First, a new high gain observer is developed for estimating the position and orientation of the ego vehicle. The novel observer is shown to perform accurate state estimation using only GNSS and gyroscope sensor readings. A vector field method which fuses the delayed camera and Lidar data is then presented. This method uses sparse 3D points obtained from Lidar and transforms them using the state estimates from the high gain observer to generate a sparse vector field for the camera image. Polynomial based interpolation is then performed to obtain the vector field for the complete image which is then remapped to synthesize images for accurate predictive display. The method is evaluated on real-world experimental data from the nuScenes and KITTI datasets. The performance of the high gain observer is also evaluated and compared with that of the EKF. The synthesized images using the vector field based predictive display are compared with ground truth images using various image metrics and offer vastly improved performance compared to delayed images.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"162 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Xiaoqin Zhang, Ling Shao, Shijian Lu, Dacheng Tao
{"title":"Visual Instruction Tuning towards General-Purpose Multimodal Large Language Model: A Survey","authors":"Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Xiaoqin Zhang, Ling Shao, Shijian Lu, Dacheng Tao","doi":"10.1007/s11263-025-02572-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02572-7","url":null,"abstract":"<p>Traditional computer vision generally solves each single task independently by a specialist model with the task instruction implicitly considered and designed in the model architecture. This simply leads to two constraints in: (1) task-specific models where each model is trained for one specific task, hindering its scalability and synergy across diverse tasks; (2) pre-defined and fixed model interfaces that have limited interactivity and adaptability in following user’s task instructions. Visual Instruction Tuning (VIT), which learns from a wide range of vision tasks as described by natural language instructions, has recently been intensively studied to mitigate the constraints of specialist models. It fine-tunes a large vision model with natural language as general task instructions, aiming for a general-purpose multimodal large language model (MLLM) that can follow various language instructions and potentially solve various user-specified vision tasks. This work aims to provide a systematic and comprehensive review of visual instruction tuning that covers six key aspects including: (1) the background of vision task paradigm and its development towards VIT; (2) the foundations of VIT including commonly used network architectures, visual instruction tuning frameworks and objectives, as well as evaluation setups and tasks; (3) widely adopted benchmarks in visual instruction tuning and evaluations; (4) a thorough review of existing VIT techniques as categorized by both vision tasks and method designs, highlighting their major contributions, strengths, as well as constraints; (5) comparison and discussion of VIT methods over various instruction-following benchmarks; (6) challenges, possible research directions and research topics in the future visual instruction tuning study. A project associated with this work has been created at [link].</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zijian Zhang, Muqing Wu, Honghao Qi, Tianyi Ma, Min Zhao
{"title":"SMPL-IKS: A Mixed Analytical-Neural Inverse Kinematics Solver for 3D Human Mesh Recovery","authors":"Zijian Zhang, Muqing Wu, Honghao Qi, Tianyi Ma, Min Zhao","doi":"10.1007/s11263-025-02574-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02574-5","url":null,"abstract":"<p>We present SMPL-IKS, a mixed analytical-neural inverse kinematics solver that operates on the well-known Skinned Multi-Person Linear model (SMPL) to recover human mesh from 3D skeleton. The key challenges in the task are threefold: (1) Shape Mismatching, (2) Error Accumulation, and (3) Rotation Ambiguity. Unlike previous methods that rely on costly vertex up-sampling or iterative optimization, SMPL-IKS directly regresses the SMPL parameters (<i>i.e.</i>, shape and pose parameters) in a clean and efficient way. Specifically, we propose to infer <i>skeleton-to-mesh</i> via three explicit mappings viz. <i>Shape Inverse (SI)</i>, <i>Inverse kinematics (IK)</i>, and <i>Pose Refinement (PR)</i>. SI maps bone length to shape parameters, IK maps bone direction to pose parameters, and PR addresses errors accumulated along the kinematic tree. SMPL-IKS is general and thus extensible to MANO or SMPL-H models. Extensive experiments are conducted on various benchmarks for body-only, hand-only, and body-hand scenarios. Our model surpasses state-of-the-art methods by a large margin while being much more efficient. Data and code are available at https://github.com/Z-Z-J/SMPL-IKS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible Camera Calibration using a Collimator System","authors":"Shunkun Liang, Banglei Guan, Zhenbao Yu, Dongcai Tan, Pengju Sun, Zibin Liu, Qifeng Yu, Yang Shang","doi":"10.1007/s11263-025-02576-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02576-3","url":null,"abstract":"<p>Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at https://github.com/LiangSK98/CollimatorCalibration.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Depth from Coupled Optical Differentiation","authors":"Junjie Luo, Yuxuan Liu, Emma Alexander, Qi Guo","doi":"10.1007/s11263-025-02534-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02534-z","url":null,"abstract":"<p>We propose depth from coupled optical differentiation, a low-computation passive-lighting 3D sensing mechanism. It is based on our discovery that per-pixel object distance can be rigorously determined by a coupled pair of optical derivatives of a defocused image using a simple, closed-form relationship. Unlike previous depth-from-defocus (DfD) methods that leverage higher-order spatial derivatives of the image to estimate scene depths, the proposed mechanism’s use of only first-order optical derivatives makes it significantly more robust to noise. Furthermore, unlike many previous DfD algorithms with requirements on aperture code, this relationship is proved to be universal to a broad range of aperture codes. We build the first 3D sensor based on depth from coupled optical differentiation. Its optical assembly includes a deformable lens and a motorized iris, which enables dynamic adjustments to the optical power and aperture radius. The sensor captures two pairs of images: one pair with a differential change of optical power and the other with a differential change of aperture scale. From the four images, a depth and confidence map can be generated with only 36 floating point operations per output pixel (FLOPOP), more than ten times lower than the previous lowest passive-lighting depth sensing solution to our knowledge. Additionally, the depth map generated by the proposed sensor demonstrates more than twice the working range of previous DfD methods while using significantly lower computation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"116 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}