{"title":"Deep Depth from Focal Stack with Defocus Model for Camera-Setting Invariance","authors":"Yuki Fujimura, Masaaki Iiyama, Takuya Funatomi, Yasuhiro Mukaigawa","doi":"10.1007/s11263-023-01964-x","DOIUrl":"https://doi.org/10.1007/s11263-023-01964-x","url":null,"abstract":"<p>We propose deep depth from focal stack (DDFS), which takes a focal stack as input of a neural network for estimating scene depth. Defocus blur is a useful cue for depth estimation. However, the size of the blur depends on not only scene depth but also camera settings such as focus distance, focal length, and f-number. Current learning-based methods without any defocus models cannot estimate a correct depth map if camera settings are different at training and test times. Our method takes a plane sweep volume as input for the constraint between scene depth, defocus images, and camera settings, and this intermediate representation enables depth estimation with different camera settings at training and test times. This camera-setting invariance can enhance the applicability of DDFS. The experimental results also indicate that our method is robust against a synthetic-to-real domain gap.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139059671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao
{"title":"Grounded Affordance from Exocentric View","authors":"Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao","doi":"10.1007/s11263-023-01962-z","DOIUrl":"https://doi.org/10.1007/s11263-023-01962-z","url":null,"abstract":"<p>Affordance grounding aims to locate objects’ “action possibilities” regions, an essential step toward embodied intelligence. Due to the diversity of interactive affordance, <i>i.e.</i>, the uniqueness of different individual habits leads to diverse interactions, which makes it difficult to establish an explicit link between object parts and affordance labels. Human has the ability that transforms various exocentric interactions into invariant egocentric affordance to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from the exocentric view, <i>i.e.</i>, given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. However, there is some “interaction bias” between personas, mainly regarding different regions and views. To this end, we devise a cross-view affordance knowledge transfer framework that extracts affordance-specific features from exocentric interactions and transfers them to the egocentric view to solve the above problems. Furthermore, the perception of affordance regions is enhanced by preserving affordance co-relations. In addition, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. The code is available via: github.com/lhc1224/Cross-View-AG.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"77 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139041466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Infrared Adversarial Patches with Learnable Shapes and Locations in the Physical World","authors":"Xingxing Wei, Jie Yu, Yao Huang","doi":"10.1007/s11263-023-01963-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01963-y","url":null,"abstract":"<p>Owing to the extensive application of infrared object detectors in the safety-critical tasks, it is necessary to evaluate their robustness against adversarial examples in the real world. However, current few physical infrared attacks are complicated to implement in practical application because of their complex transformation from the digital world to physical world. To address this issue, in this paper, we propose a physically feasible infrared attack method called “infrared adversarial patches”. Considering the imaging mechanism of infrared cameras by capturing objects’ thermal radiation, infrared adversarial patches conduct attacks by attaching a patch of thermal insulation materials on the target object to manipulate its thermal distribution. To enhance adversarial attacks, we present a novel aggregation regularization to guide the simultaneous learning for the patch’s shape and location on the target object. Thus, a simple gradient-based optimization can be adapted to solve for them. We verify infrared adversarial patches in different object detection tasks with various object detectors. Experimental results show that our method achieves more than 90% Attack Success Rate (ASR) versus the pedestrian detector and vehicle detector in the physical environment, where the objects are captured in different angles, distances, postures, and scenes. More importantly, infrared adversarial patch is easy to implement, and it only needs 0.5 h to be manufactured in the physical world, which verifies its effectiveness and efficiency. Another advantage of our infrared adversarial patches is the ability to extend to attack the visible object detector in the physical world. As a consequence, we can simultaneously perform the infrared and visible physical attacks by a unified adversarial patch, which shows the good generalization.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138887330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convex–Concave Tensor Robust Principal Component Analysis","authors":"","doi":"10.1007/s11263-023-01960-1","DOIUrl":"https://doi.org/10.1007/s11263-023-01960-1","url":null,"abstract":"<h3>Abstract</h3> <p>Tensor robust principal component analysis (TRPCA) aims at recovering the underlying low-rank clean tensor and residual sparse component from the observed tensor. The recovery quality heavily depends on the definition of tensor rank which has diverse construction schemes. Recently, tensor average rank has been proposed and the tensor nuclear norm has been proven to be its best convex surrogate. Many improved works based on the tensor nuclear norm have emerged rapidly. Nevertheless, there exist three common drawbacks: (1) the neglect of consideration on relativity between the distribution of large singular values and low-rank constraint; (2) the prior assumption of equal treatment for frontal slices hidden in tensor nuclear norm; (3) the missing convergence of whole iteration sequences in optimization. To address these problems together, in this paper, we propose a convex–concave TRPCA method in which the notion of convex–convex singular value separation (CCSVS) plays a dominant role in the objective. It can adjust the distribution of the first several largest singular values with low-rank controlling in a relative way and emphasize the importance of frontal slices collaboratively. Remarkably, we provide the rigorous convergence analysis of whole iteration sequences in optimization. Besides, a low-rank tensor recovery guarantee is established for the proposed CCSVS model. Extensive experiments demonstrate that the proposed CCSVS significantly outperforms state-of-the-art methods over toy data and real-world datasets, and running time per image is also the fastest.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"35 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138822838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chang Liu, Gaurav Mittal, Nikolaos Karianakis, Victor Fragoso, Ye Yu, Yun Fu, Mei Chen
{"title":"HyperSTAR: Task-Aware Hyperparameter Recommendation for Training and Compression","authors":"Chang Liu, Gaurav Mittal, Nikolaos Karianakis, Victor Fragoso, Ye Yu, Yun Fu, Mei Chen","doi":"10.1007/s11263-023-01961-0","DOIUrl":"https://doi.org/10.1007/s11263-023-01961-0","url":null,"abstract":"<p>Hyperparameter optimization (HPO) methods alleviate the significant effort required to obtain hyperparameters that perform optimally on visual learning problems. Existing methods are computationally inefficient because they are task agnostic (i.e., they do not adapt to a given task). We present HyperSTAR (System for Task Aware Hyperparameter Recommendation), a task-aware HPO algorithm that improves HPO efficiency for a target dataset by using prior knowledge from previous hyperparameter searches to recommend effective hyperparameters conditioned on the target dataset. HyperSTAR ranks and recommends hyperparameters by predicting their performance on the target dataset. To do so, it learns a joint dataset-hyperparameter space in an end-to-end manner that enables its performance predictor to use previously found effective hyperparameters for other similar tasks. The hyperparameter recommendations of HyperSTAR combined with existing HPO techniques lead to a task-aware HPO system that reduces the time to find the optimal hyperparameters for the target learning problem. Our experiments on image classification, object detection, and model pruning validate that HyperSTAR reduces the evaluation of different hyperparameter configurations by about <span>(50%)</span> compared to existing methods and, when combined with Hyperband, uses only <span>(25%)</span> of the budget required by the vanilla Hyperband and Bayesian Optimized Hyperband to achieve the best performance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"52 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138840415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Azin Jahedi, Maximilian Luz, Marc Rivinius, Lukas Mehl, Andrés Bruhn
{"title":"MS-RAFT+: High Resolution Multi-Scale RAFT","authors":"Azin Jahedi, Maximilian Luz, Marc Rivinius, Lukas Mehl, Andrés Bruhn","doi":"10.1007/s11263-023-01930-7","DOIUrl":"https://doi.org/10.1007/s11263-023-01930-7","url":null,"abstract":"<p>Hierarchical concepts have proven useful in many classical and learning-based optical flow methods regarding both accuracy and robustness. In this paper we show that such concepts are still useful in the context of recent neural networks that follow RAFT’s paradigm refraining from hierarchical strategies by relying on recurrent updates based on a single-scale all-pairs transform. To this end, we introduce MS-RAFT+: a novel recurrent multi-scale architecture based on RAFT that unifies several successful hierarchical concepts. It employs a coarse-to-fine estimation to enable the use of finer resolutions by useful initializations from coarser scales. Moreover, it relies on RAFT’s correlation pyramid that allows to consider non-local cost information during the matching process. Furthermore, it makes use of advanced multi-scale features that incorporate high-level information from coarser scales. And finally, our method is trained subject to a sample-wise robust multi-scale multi-iteration loss that closely supervises each iteration on each scale, while allowing to discard particularly difficult samples. In combination with an appropriate mixed-dataset training strategy, our method performs favorably. It not only yields highly accurate results on the four major benchmarks (KITTI 2015, MPI Sintel, Middlebury and VIPER), it also allows to achieve these results with a single model and a single parameter setting. Our trained model and code are available at https://github.com/cv-stuttgart/MS_RAFT_plus.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138713886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Universal Event-Based Plug-In Module for Visual Object Tracking in Degraded Conditions","authors":"","doi":"10.1007/s11263-023-01959-8","DOIUrl":"https://doi.org/10.1007/s11263-023-01959-8","url":null,"abstract":"<h3>Abstract</h3> <p>Most existing trackers based on RGB/grayscale frames may collapse due to the unreliability of conventional sensors in some challenging scenarios (e.g., motion blur and high dynamic range). Event-based cameras as bioinspired sensors encode brightness changes with high temporal resolution and high dynamic range, thereby providing considerable potential for tracking under degraded conditions. Nevertheless, events lack the fine-grained texture cues provided by RGB/grayscale frames. This complementarity encourages us to fuse visual cues from the frame and event domains for robust object tracking under various challenging conditions. In this paper, we propose a novel event feature extractor to capture spatiotemporal features with motion cues from event-based data by boosting interactions and distinguishing alterations between states at different moments. Furthermore, we develop an effective feature integrator to adaptively fuse the strengths of both domains by balancing their contributions. Our proposed module as the plug-in can be easily applied to off-the-shelf frame-based trackers. We extensively validate the effectiveness of eight trackers extended by our approach on three datasets: EED, VisEvent, and our collected frame-event-based dataset FE141. Experimental results also show that event-based data is a powerful cue for tracking.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138740579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Ultra High-Definition Video Deblurring via Multi-scale Separable Network","authors":"Wenqi Ren, Senyou Deng, Kaihao Zhang, Fenglong Song, Xiaochun Cao, Ming-Hsuan Yang","doi":"10.1007/s11263-023-01958-9","DOIUrl":"https://doi.org/10.1007/s11263-023-01958-9","url":null,"abstract":"<p>Despite significant progress has been made in image and video deblurring, much less attention has been paid to process ultra high-definition (UHD) videos (e.g., 4K resolution). In this work, we propose a novel deep model for fast and accurate UHD video deblurring (UHDVD). The proposed UHDVD is achieved by a depth-wise separable-patch architecture, which operates with a multi-scale integration scheme to achieve a large receptive field without adding the number of generic convolutional layers and kernels. Additionally, we adopt the temporal feature attention module to effectively exploit the temporal correlation between video frames to obtain clearer recovered images. We design an asymmetrical encoder–decoder architecture with residual channel-spatial attention blocks to improve accuracy and reduce the depth of the network appropriately. Consequently, the proposed UHDVD achieves real-time performance on 4K videos at 30 fps. To train the proposed model, we build a new dataset comprised of 4K blurry videos and corresponding sharp frames using three different smartphones. Extensive experimental results show that our network performs favorably against the state-of-the-art methods on the proposed 4K dataset and existing 720p and 2K benchmarks in terms of accuracy, speed, and model size.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138571253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kongming Liang, Zijin Yin, Min Min, Yan Liu, Zhanyu Ma, Jun Guo
{"title":"Learning Dynamic Prototypes for Visual Pattern Debiasing","authors":"Kongming Liang, Zijin Yin, Min Min, Yan Liu, Zhanyu Ma, Jun Guo","doi":"10.1007/s11263-023-01956-x","DOIUrl":"https://doi.org/10.1007/s11263-023-01956-x","url":null,"abstract":"<p>Deep learning has achieved great success in academic benchmarks but fails to work effectively in the real world due to the potential dataset bias. The current learning methods are prone to inheriting or even amplifying the bias present in a training dataset and under-represent specific demographic groups. More recently, some dataset debiasing methods have been developed to address the above challenges based on the awareness of protected or sensitive attribute labels. However, the number of protected or sensitive attributes may be considerably large, making it laborious and costly to acquire sufficient manual annotation. To this end, we propose a prototype-based network to dynamically balance the learning of different subgroups for a given dataset. First, an object pattern embedding mechanism is presented to make the network focus on the foreground region. Then we design a prototype learning method to discover and extract the visual patterns from the training data in an unsupervised way. The number of prototypes is dynamic depending on the pattern structure of the feature space. We evaluate the proposed prototype-based network on three widely used polyp segmentation datasets with abundant qualitative and quantitative experiments. Experimental results show that our proposed method outperforms the CNN-based and transformer-based state-of-the-art methods in terms of both effectiveness and fairness metrics. Moreover, extensive ablation studies are conducted to show the effectiveness of each proposed component and various parameter values. Lastly, we analyze how the number of prototypes grows during the training process and visualize the associated subgroups for each learned prototype. The code and data will be released at https://github.com/zijinY/dynamic-prototype-debiasing.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138559328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction: SOTVerse: A User-Defined Task Space of Single Object Tracking","authors":"Shiyu Hu, Xin Zhao, Kaiqi Huang","doi":"10.1007/s11263-023-01968-7","DOIUrl":"https://doi.org/10.1007/s11263-023-01968-7","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 10","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138586247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}