{"title":"Occlusion-Aware Self-Supervised Stereo Matching with Confidence Guided Raw Disparity Fusion","authors":"Xiule Fan, Soo Jeon, B. Fidan","doi":"10.1109/CRV55824.2022.00025","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00025","url":null,"abstract":"Commercially available stereo cameras used in robots and other intelligent systems to obtain depth information typically rely on traditional stereo matching algorithms. Although their raw (predicted) disparity maps contain incorrect estimates, these algorithms can still provide useful prior information towards more accurate prediction. We propose a pipeline to incorporate this prior information to produce more accurate disparity maps. The proposed pipeline includes a confidence generation component to identify raw disparity inaccuracies as well as a self-supervised deep neural network (DNN) to predict disparity and compute the corresponding occlusion masks. The proposed DNN consists of a feature extraction module, a confidence guided raw disparity fusion module to generate an initial disparity map, and a hierarchical occlusion-aware disparity refinement module to compute the final estimates. Experimental results on public datasets verify that the proposed pipeline has competitive accuracy with real-time processing rate. We also test the pipeline with images captured by commercial stereo cameras to show its effectiveness in improving their raw disparity estimates.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116524836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Classification of handwritten annotations in mixed-media documents","authors":"Amanda Dash, A. Albu","doi":"10.1109/CRV55824.2022.00027","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00027","url":null,"abstract":"Handwritten annotations in documents contain valuable information, but they are challenging to detect and identify. This paper addresses this challenge. We propose an al-gorithm for generating a novel mixed-media document dataset, Annotated Docset, that consists of 14 classes of machine-printed and handwritten elements and annotations. We also propose a novel loss function, Dense Loss, which can correctly identify small objects in complex documents when used in fully convolutional networks (e.g. U-NET, DeepLabV3+). Our Dense Loss function is a compound function that uses local region homogeneity to promote contiguous and smooth segmentation predictions while also using an L1-norm loss to reconstruct the dense-labelled ground truth. By using regression instead of a probabilistic approach to pixel classification, we avoid the pitfalls of training on datasets with small or underrepre-sented objects. We show that our loss function outperforms other semantic segmentation loss functions for imbalanced datasets, containing few elements that occupy small areas. Experimental results show that the proposed method achieved a mean Intersection-over-Union (mIoU) score of 0.7163 for all document classes and 0.6290 for handwritten annotations, thus outperforming state-of-the-art loss functions.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128668276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Occluded Text Detection and Recognition in the Wild","authors":"Z. Raisi, J. Zelek","doi":"10.1109/CRV55824.2022.00026","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00026","url":null,"abstract":"The performance of existing deep-learning scene text recognition-based methods fails significantly on occluded text instances or even partially occluded characters in a text due to their reliance on the visibility of the target characters in images. This failure is often due to features generated by the current architectures with limited robustness to occlusion, which opens the possibility of improving the feature extractors and/or the learning models to better handle these severe occlusions. In this paper, we first evaluate the performance of the current scene text detection, scene text recognition, and scene text spotting models using two publicly-available occlusion datasets: Occlusion Scene Text (OST) that is designed explicitly for scene text recognition, and we also prepare an Occluded Character-level using the Total-Text (OCTT) dataset for evaluating the scene text spotting and detection models. Then we utilize a very recent Transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text detection and recognition pipelines to mitigate the occlusion problem. The performance of our scene text recognition and end-to-end scene text spotting models improves by transfer learning on the pre-trained MAE backbone. For example, our recognition model witnessed a 4% word recognition accuracy on the OST dataset. Our end-to-end text spotting model achieved 68.5% F-measure performance outperforming the stat-of-the-art methods when equipped with an MAE backbone compared to a convolutional neural network (CNN) backbone on the OCTT dataset.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115002103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings 2022 19th Conference on Robots and Vision","authors":"","doi":"10.1109/crv55824.2022.00001","DOIUrl":"https://doi.org/10.1109/crv55824.2022.00001","url":null,"abstract":"","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133557166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alex L. Slonimer, Melissa Cote, T. Marques, A. Rezvanifar, S. Dosso, A. Albu, Kaan Ersahin, T. Mudge, S. Gauthier
{"title":"Instance Segmentation of Herring and Salmon Schools in Acoustic Echograms using a Hybrid U-Net","authors":"Alex L. Slonimer, Melissa Cote, T. Marques, A. Rezvanifar, S. Dosso, A. Albu, Kaan Ersahin, T. Mudge, S. Gauthier","doi":"10.1109/CRV55824.2022.00010","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00010","url":null,"abstract":"The automated classification of fish, such as herring and salmon, in multi-frequency echograms is important for ecosystems monitoring. This paper implements a novel approach to instance segmentation: a hybrid of deep-learning and heuristic methods. This approach implements semantic segmentation by a U-Net to detect fish, which are converted to instances of fish-schools derived from candidate components within a defined linking distance. In addition to four frequency channels of echogram data (67.5, 125, 200, 455 kHz), two simulated channels (water depth and solar elevation angle) are included to encode spatial and temporal information, which leads to substantial improvement in model performance. The model is shown to out-perform recent experiments that have used a Mask R-CNN architecture. This approach demonstrates the ability to classify sparsely distributed objects in a way that is not possible with state-of-the-art instance segmentation methods.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124585471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Lasso Method for Multi-Robot Foraging","authors":"A. Vardy","doi":"10.1109/CRV55824.2022.00022","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00022","url":null,"abstract":"We propose a novel approach to multi-robot foraging. This approach makes use of a scalar field to guide robots throughout an environment while gathering objects towards the goal. The environment must be planar with a closed, contiguous boundary. However, the boundary's shape can be arbitrary. Conventional robot foraging methods assume an open environment or a simple boundary that never impedes the robots—a limitation which our method overcomes. Our distributed control algorithm causes the robots to circumnavigate the environment and nudge objects inwards towards the goal. We demonstrate the performance of our approach using real-world and simulated experiments and study the impact of the number of robots, the complexity of the boundary, and limitations on the sensing range.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129602458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Exact Fast Fourier Method for Morphological Dilation and Erosion Using the Umbra Technique","authors":"V. Sridhar, M. Breuß","doi":"10.1109/CRV55824.2022.00032","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00032","url":null,"abstract":"In this paper we consider the fundamental operations dilation and erosion of mathematical morphology. It is well known that many powerful image filtering operations can be constructed by their combinations. We propose a fast and novel algorithm based on the Fast Fourier Transform to compute grey-value morphological operations on an image. The novel method may deal with non-flat filters and incorporates no restrictions on shape and size of the filtering window, in contrast to many other fast methods in the field. Unlike fast Fourier techniques from previous works, the novel method gives exact results and is not an approximation. The key aspect which allows to achieve this is to explore here for the first time in this context the umbra formulation of images and filters. We show that the new method is in practice particularly suitable for filtering images with small tonal range or when employing large filter sizes.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127481240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised Grounding Alignment for Multi-modal Feature Learning","authors":"Shih-Han Chou, Zicong Fan, J. Little, L. Sigal","doi":"10.1109/CRV55824.2022.00015","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00015","url":null,"abstract":"Self-supervised transformer-based architectures, such as ViLBERT [1] and others, have recently emerged as dominant paradigms for multi-modal feature learning. Such architectures leverage large-scale datasets (e.g., Conceptual Captions [2]) and, typically, image-sentence pairings, for self-supervision. However, conventional multi-modal feature learning requires huge datasets and computing for both pre-training and fine-tuning to the target task. In this paper, we illustrate that more granular semi-supervised alignment at a region-phrase level is an additional useful cue and can further improve the performance of such representations. To this end, we propose a novel semi-supervised grounding alignment loss, which leverages an off-the-shelf pre-trained phrase grounding model for pseudo-supervision (by producing region-phrase alignments). This semi-supervised formulation enables better feature learning in the absence of any additional human annotations on the large-scale (Conceptual Captions) dataset. Further, it shows an even larger margin of improvement on smaller data splits, leading to effective data-efficient feature learning. We illustrate the superiority of the learned features by fine-tuning the resulting models to multiple vision-language downstream tasks: visual question answering (VQA), visual commonsense reasoning (VCR), and visual grounding. Experiments on the VQA, VCR, and grounding benchmarks demonstrate the improvement of up to 1.3% in accuracy (in visual grounding) with large-scale training; up to 5.9% (in VQA) with 1/8 of the data for pre-training and fine-tuning11We will release the code and all pre-trained models upon acceptance..","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116919633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sakineh Abdollahzadeh, Pier-Luc Proulx, M. S. Allili, J. Lapointe
{"title":"Safe Landing Zones Detection for UAVs Using Deep Regression","authors":"Sakineh Abdollahzadeh, Pier-Luc Proulx, M. S. Allili, J. Lapointe","doi":"10.1109/CRV55824.2022.00035","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00035","url":null,"abstract":"Finding safe landing zones (SLZ) in urban areas and natural scenes is one of the many challenges that must be overcome in automating Unmanned Aerial Vehicles (UAV) navigation. Using passive vision sensors to achieve this objective is a very promising avenue due to their low cost and the potential they provide for performing simultaneous terrain analysis and 3D reconstruction. In this paper, we propose using a deep learning approach on UAV imagery to assess the SLZ. The model is built on a semantic segmentation architecture whereby thematic classes of the terrain are mapped into safety scores for UAV landing. Contrary to past methods, which use hard classification into safe/unsafe landing zones, our approach provides a continuous safety map that is more practical for an emergency landing. Experiments on public datasets have shown promising results.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124039864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}