{"title":"Robust Unpaired Image Dehazing via Density and Depth Decomposition","authors":"Yang Yang, Chaoyue Wang, Xiaojie Guo, Dacheng Tao","doi":"10.1007/s11263-023-01940-5","DOIUrl":"https://doi.org/10.1007/s11263-023-01940-5","url":null,"abstract":"<p>To overcome the overfitting issue of dehazing models trained on synthetic hazy-clean image pairs, recent methods attempt to boost the generalization ability by training on unpaired data. However, most of existing approaches simply resort to formulating dehazing–rehazing cycles with generative adversarial networks, yet ignore the physical property in the real-world hazy environment, i.e., the haze effect varies along with density and depth. This paper proposes a robust self-augmented image dehazing framework for haze generation and removal. Instead of merely estimating transmission maps or clean content, the proposed scheme focuses on exploring the scattering coefficient and depth information of hazy and clean images. Having the scene depth estimated, our method is capable of re-rendering hazy images with different thicknesses, which benefits the training of the dehazing network. Besides, a dual contrastive perceptual loss is introduced to further improve the quality of both dehazed and rehazed images. Comprehensive experiments are conducted to reveal the advance of our method over other state-of-the-art unpaired dehazing methods in terms of visual quality, model size, and computational cost. Moreover, our model can be robustly trained on, not only synthetic indoor datasets, but also real outdoor scenes with remarkable improvement on the real-world image dehazing. Our code and training data are available at: https://github.com/YaN9-Y/D4_plus.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"80 20","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138437631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAE-GReaT: Convolutional-Auxiliary Efficient Graph Reasoning Transformer for Dense Image Predictions","authors":"Dong Zhang, Yi Lin, Jinhui Tang, Kwang-Ting Cheng","doi":"10.1007/s11263-023-01928-1","DOIUrl":"https://doi.org/10.1007/s11263-023-01928-1","url":null,"abstract":"<p>Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) are two primary frameworks for current semantic image recognition tasks in the community of computer vision. The general consensus is that both CNNs and ViT have their latent strengths and weaknesses, e.g., CNNs are good at extracting local features but difficult to aggregate long-range feature dependencies, while ViT is good at aggregating long-range feature dependencies but poorly represents in local features. In this paper, we propose an auxiliary and integrated network architecture, named Convolutional-Auxiliary Efficient Graph Reasoning Transformer (CAE-GReaT), which joints strengths of both CNNs and ViT into a uniform framework. CAE-GReaT stands on the shoulders of the advanced graph reasoning transformer and employs an internal auxiliary convolutional branch to enrich the local feature representations. Besides, to reduce the computational costs in graph reasoning, we also propose an efficient information diffusion strategy. Compared to the existing ViT models, CAE-GReaT not only has the advantage of a purposeful interaction pattern (<i>via the graph reasoning branch</i>), but also can capture fine-grained heterogeneous feature representations (<i>via the auxiliary convolutional branch</i>). Extensive experiments are implemented on three challenging dense image prediction tasks, i.e., semantic segmentation, instance segmentation, and panoptic segmentation. Results demonstrate that CAE-GReaT can achieve consistent performance gains on the state-of-the-art baselines with a slightly computational cost.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 21","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138437610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking","authors":"Peng Gao, Ziyi Lin, Renrui Zhang, Rongyao Fang, Hongyang Li, Hongsheng Li, Yu Qiao","doi":"10.1007/s11263-023-01898-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01898-4","url":null,"abstract":"<p>Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to <i>M</i>imic before <i>R</i>econstruct for Masked Autoencoders, named as <i>MR-MAE</i>, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by <span>(+2.2)</span>% and the previous state-of-the-art BEiT V2 base by <span>(+0.3)</span>%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 23","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138437608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Universal Representations: A Unified Look at Multiple Task and Domain Learning","authors":"Wei-Hong Li, Xialei Liu, Hakan Bilen","doi":"10.1007/s11263-023-01931-6","DOIUrl":"https://doi.org/10.1007/s11263-023-01931-6","url":null,"abstract":"<p>We propose a unified look at jointly learning multiple vision tasks and visual domains through <i>universal representations</i>, a single deep neural network. Learning multiple problems simultaneously involves minimizing a weighted sum of multiple loss functions with different magnitudes and characteristics and thus results in unbalanced state of one loss dominating the optimization and poor results compared to learning a separate model for each problem. To this end, we propose distilling knowledge of multiple task/domain-specific networks into a single deep neural network after aligning its representations with the task/domain-specific ones through small capacity adapters. We rigorously show that universal representations achieve state-of-the-art performances in learning of multiple dense prediction problems in NYU-v2 and Cityscapes, multiple image classification problems from diverse domains in Visual Decathlon Dataset and cross-domain few-shot learning in MetaDataset. Finally we also conduct multiple analysis through ablation and qualitative studies.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 20","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138437507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rongcheng Wu, Mingzhe Wang, Zhidong Li, Jianlong Zhou, Fang Chen, Xuan Wang, Changming Sun
{"title":"Few-Shot Stereo Matching with High Domain Adaptability Based on Adaptive Recursive Network","authors":"Rongcheng Wu, Mingzhe Wang, Zhidong Li, Jianlong Zhou, Fang Chen, Xuan Wang, Changming Sun","doi":"10.1007/s11263-023-01953-0","DOIUrl":"https://doi.org/10.1007/s11263-023-01953-0","url":null,"abstract":"<p>Deep learning based stereo matching algorithms have been extensively researched in areas such as robot vision and autonomous driving due to their promising performance. However, these algorithms require a large amount of labeled data for training and encounter inadequate domain adaptability, which degraded their applicability and flexibility. This work addresses the two deficiencies and proposes a few-shot trained stereo matching model with high domain adaptability. In the model, stereo matching is formulated as the problem of dynamic optimization in the possible solution space, and a multi-scale matching cost computation method is proposed to obtain the possible solution space for the application scenes. Moreover, an adaptive recurrent 3D convolutional neural network is designed to determine the optimal solution from the possible solution space. Experimental results demonstrate that the proposed model outperforms the state-of-the-art stereo matching algorithms in terms of training requirements and domain adaptability.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 22","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138437609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastTrack: A Highly Efficient and Generic GPU-Based Multi-object Tracking Method with Parallel Kalman Filter","authors":"Chongwei Liu, Haojie Li, Zhihui Wang","doi":"10.1007/s11263-023-01933-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01933-4","url":null,"abstract":"<p>The Kalman Filter based on uniform assumption has been a crucial motion estimation module in trackers. However, it has limitations in non-uniform motion modeling and computational efficiency when applied to large-scale object tracking scenarios. To address these issues, we propose a novel <b><i>Parallel Kalman Filter (PKF)</i></b>, which simplifies conventional state variables to reduces computational load and enable effective non-uniform modeling. Within PKF, we propose a non-uniform formulation which models non-uniform motion as uniform motion by transforming the time interval <span>(Delta t)</span> from a constant into a variable related to displacement, and incorporate a deceleration strategy into the control-input model of the formulation to tackle the escape problem in Multi-Object Tracking (MOT); an innovative parallel computation method is also proposed, which transposes the computation graph of PKF from the matrix to the quadratic form, significantly reducing the computational load and facilitating parallel computation between distinct tracklets via CUDA, thus making the time consumption of PKF independent of the input tracklet scale, i.e., <i>O</i>(1). Based on PKF, we introduce <b><i>Fast</i></b>, <i>the first fully GPU-based tracker paradigm</i>, which significantly enhances tracking efficiency in large-scale object tracking scenarios; and <b><i>FastTrack</i></b>, the MOT system composed of Fast and a general detector, offering high efficiency and generality. Within FastTrack, Fast only requires bounding boxes with scores and class ids for a single association during one iteration, and introduces innovative GPU-based tracking modules, such as an efficient GPU 2D-array data structure for tracklet management, a novel cost matrix implemented in CUDA for automatic association priority determination, a new association metric called HIoU, and the first implementation of the Auction Algorithm in CUDA for the asymmetric assignment problem. Experiments show that the average time per iteration of PKF on a GTX 1080Ti is only 0.2 ms; Fast can achieve a real-time efficiency of 250FPS on a GTX 1080Ti and 42FPS even on a Jetson AGX Xavier, outperforming conventional CPU-based trackers. Concurrently, FastTrack demonstrates state-of-the-art performance on four public benchmarks, specifically MOT17, MOT20, KITTI, and DanceTrack, and attains the highest speed in large-scale tracking scenarios of MOT20.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 22","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138293686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weitao Feng, Lei Bai, Yongqiang Yao, Fengwei Yu, Wanli Ouyang
{"title":"Towards Frame Rate Agnostic Multi-object Tracking","authors":"Weitao Feng, Lei Bai, Yongqiang Yao, Fengwei Yu, Wanli Ouyang","doi":"10.1007/s11263-023-01943-2","DOIUrl":"https://doi.org/10.1007/s11263-023-01943-2","url":null,"abstract":"<p>Multi-object Tracking (MOT) is one of the most fundamental computer vision tasks that contributes to various video analysis applications. Despite the recent promising progress, current MOT research is still limited to a fixed sampling frame rate of the input stream. They are neither as flexible as humans nor well-matched to industrial scenarios which require the trackers to be frame rate insensitive in complicated conditions. In fact, we empirically found that the accuracy of all recent state-of-the-art trackers drops dramatically when the input frame rate changes. For a more intelligent tracking solution, we shift the attention of our research work to the problem of Frame Rate Agnostic MOT (FraMOT), which takes frame rate insensitivity into consideration. In this paper, we propose a Frame Rate Agnostic MOT framework with a Periodic training Scheme (FAPS) to tackle the FraMOT problem for the first time. Specifically, we propose a Frame Rate Agnostic Association Module (FAAM) that infers and encodes the frame rate information to aid identity matching across multi-frame-rate inputs, improving the capability of the learned model in handling complex motion-appearance relations in FraMOT. Moreover, the association gap between training and inference is enlarged in FraMOT because those post-processing steps not included in training make a larger difference in lower frame rate scenarios. To address it, we propose Periodic Training Scheme to reflect all post-processing steps in training via tracking pattern matching and fusion. Along with the proposed approaches, we make the first attempt to establish an evaluation method for this new task of FraMOT. Besides providing simulations and evaluation metrics, we try to solve new challenges in two different modes, i.e., known frame rate and unknown frame rate, aiming to handle a more complex situation. The quantitative experiments on the challenging MOT17/20 dataset (FraMOT version) have clearly demonstrated that the proposed approaches can handle different frame rates better and thus improve the robustness against complicated scenarios.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 21","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138293687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PartCom: Part Composition Learning for 3D Open-Set Recognition","authors":"Tingyu Weng, Jun Xiao, Hao Pan, Haiyong Jiang","doi":"10.1007/s11263-023-01947-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01947-y","url":null,"abstract":"<p>In this work, we address 3D open-set recognition (OSR) that can recognize known classes as well as be aware of unknown classes during testing. The key challenge of 3D OSR is that unknown objects are not available during training and 3D closed set recognition methods trained on known classes usually classify an unknown object as a known one with high confidence. This over-confidence is mainly due to the fact that local part information in 3D shapes provides the main evidence for known class recognition, which nevertheless leads to the incorrect recognition of unknown classes that have similar local parts but arranged very differently. To address this problem, we propose <i>PartCom</i>, a 3D OSR method that calls attention to not only part information but also the part composition that is unique to each class. <i>PartCom</i> uses a part codebook to learn the different parts across object classes, and represents part composition as a latent distribution over the codebook. In this way, both known classes and unknown classes are cast into the space of learned parts, but known classes have composites largely distinguished from unknown ones, which enables OSR. To learn the part codebook, we formulate two necessary constraints to ensure the part codebook encodes diverse parts of different classes compactly and efficiently. In addition, we propose an optional augmenting module of <i>Part-aware Unknown feaTure Synthesis</i>, that further reduces open-set misclassification risks by synthesizing novel part compositions to be regarded as unknown classes. This synthesis is simply achieved by mixing part codes of different classes; training with such augmented data makes classifiers’ decision boundaries more closely fit the known classes and therefore improves open-set recognition. To evaluate the proposed method, we construct four 3D OSR tasks based on datasets of CAD shapes, multi-view scanned shapes, and LiDAR scanned shapes. Extensive experiments show that our method achieves significantly superior results than SOTA baselines on all tasks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 20","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138293128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adapting Across Domains via Target-Oriented Transferable Semantic Augmentation Under Prototype Constraint","authors":"Mixue Xie, Shuang Li, Kaixiong Gong, Yulin Wang, Gao Huang","doi":"10.1007/s11263-023-01944-1","DOIUrl":"https://doi.org/10.1007/s11263-023-01944-1","url":null,"abstract":"<p>The demand for reducing label annotation cost and adapting to new data distributions gives rise to the emergence of domain adaptation (DA). DA aims to learn a model that performs well on the unlabeled or scarcely labeled target domain by transferring the rich knowledge from a related and well-annotated source domain. Existing DA methods mainly resort to learning domain-invariant representations with a source-supervised classifier shared by two domains. However, such a shared classifier may bias towards source domain, limiting its generalization capability on target data. To alleviate this issue, we present a <i>target-oriented transferable semantic augmentation (T</i><span>(^2)</span><i>SA)</i> method, which enhances the generalization ability of the classifier by training it with a target-like augmented domain, constructed by semantically augmenting source data towards target at the feature level in an implicit manner. Specifically, to equip the augmented domain with target semantics, we delicately design a class-wise multivariate normal distribution based on the statistics estimated from features to sample the transformation directions for source data. Moreover, we achieve the augmentation implicitly by minimizing the upper bound of the expected Angular-softmax loss over the augmented domain, which is of high efficiency. Additionally, to further ensure that the augmented domain can imitate target domain nicely and discriminatively, the prototype constraint is enforced on augmented features class-wisely, which minimizes the expected distance between augmented features and corresponding target prototype (i.e., average representation) in Euclidean space. As a general technique, T<span>(^2)</span>SA can be easily plugged into various DA methods to further boost their performances. Extensive experiments under single-source DA, multi-source DA and domain generalization scenarios validate the efficacy of T<span>(^2)</span>SA.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 19","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138293129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Wilson, Xiaohan Zhang, Waqas Sultani, Safwan Wshah
{"title":"Image and Object Geo-Localization","authors":"Daniel Wilson, Xiaohan Zhang, Waqas Sultani, Safwan Wshah","doi":"10.1007/s11263-023-01942-3","DOIUrl":"https://doi.org/10.1007/s11263-023-01942-3","url":null,"abstract":"<p>The concept of geo-localization broadly refers to the process of determining an entity’s geographical location, typically in the form of Global Positioning System (GPS) coordinates. The entity of interest may be an image, a sequence of images, a video, a satellite image, or even objects visible within the image. Recently, massive datasets of GPS-tagged media have become available due to smartphones and the internet, and deep learning has risen to prominence and enhanced the performance capabilities of machine learning models. These developments have enabled the rise of image and object geo-localization, which has impacted a wide range of applications such as augmented reality, robotics, self-driving vehicles, road maintenance, and 3D reconstruction. This paper provides a comprehensive survey of visual geo-localization, which may involve either determining the location at which an image has been captured (image geo-localization) or geolocating objects within an image (object geo-localization). We will provide an in-depth study of visual geo-localization including a summary of popular algorithms, a description of proposed datasets, and an analysis of performance results to illustrate the current state of the field.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"83 21","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138293127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}