{"title":"Exploring Joint Embedding Architectures and Data Augmentations for Self-Supervised Representation Learning in Event-Based Vision","authors":"Sami Barchid, José Mennesson, C. Djeraba","doi":"10.1109/CVPRW59228.2023.00405","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00405","url":null,"abstract":"This paper proposes a self-supervised representation learning (SSRL) framework for event-based vision, which leverages various lightweight convolutional neural networks (CNNs) including 2D-, 3D-, and Spiking CNNs. The method uses a joint embedding architecture to maximize the agreement between features extracted from different views of the same event sequence. Popular event data augmentation techniques are employed to design an efficient augmentation policy for event-based SSRL, and we provide novel data augmentation methods to enhance the pretraining pipeline. Given the novelty of SSRL for event-based vision, we elaborate standard evaluation protocols and use them to evaluate our approach. Our study demonstrates that pretrained CNNs acquire effective and transferable features, enabling them to achieve competitive performance in object or action recognition across various commonly used event-based datasets, even in a low-data regime. This paper also conducts an experimental analysis of the extracted features regarding the Uniformity-Tolerance tradeoff to assess their quality, and measure the similarity of representations using linear Center Kernel Alignement. These quantitative measurements reinforce our observations from the performance benchmarks and show substantial differences between the learned representations of all types of CNNs despite being optimized with the same approach.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113966183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RB-Dust - A Reference-based Dataset for Vision-based Dust Removal","authors":"P. Buckel, T. Oksanen, Thomas Dietmueller","doi":"10.1109/CVPRW59228.2023.00121","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00121","url":null,"abstract":"Dust in the agricultural landscape is a significant challenge and influences, for example, the environmental perception of autonomous agricultural machines. Image enhancement algorithms can be used to reduce dust. However, these require dusty and dust-free images of the same environment for validation. In fact, to date, there is no dataset that we are aware of that addresses this issue. Therefore, we present the agriscapes RB-Dust dataset, which is named after its purpose of reference-based dust removal. It is not possible to take pictures from the cabin during tillage, as this would cause shifts in the images. Because of this, we built a setup from which it is possible to take images from a stationary position close to the passing tractor. The test setup was based on a half-sided gate through which the tractor could drive. The field tests were carried out on a farm in Bavaria, Germany, during tillage. During the field tests, other parameters such as soil moisture and wind speed were controlled, as these significantly affect dust development. We validated our dataset with contrast enhancement and image dehazing algorithms and analyzed the generalizability from recordings from the moving tractor. Finally, we demonstrate the application of dust removal based on a high-level vision task, such as person classification. Our empirical study confirms the validity of RB-Dust for vision-based dust removal in agriculture.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126442226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Data-Driven Approach based on Dynamic Mode Decomposition for Efficient Encoding of Dynamic Light Fields","authors":"Joshitha Ravishankar, Sally Khaidem, Mansi Sharma","doi":"10.1109/CVPRW59228.2023.00347","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00347","url":null,"abstract":"Dynamic light fields provide a richer, more realistic 3D representation of a moving scene. However, this leads to higher data rates since excess storage and transmission requirements are needed. We propose a novel approach to efficiently represent and encode dynamic light field data for display applications based on dynamic mode decomposition (DMD). Acquired images are firstly obtained through optimized coded aperture patterns for each temporal frame/camera viewpoint of a dynamic light field. The underlying spatial, angular, and temporal correlations are effectively exploited by a data-driven DMD on these acquired images arranged as time snapshots. Next, High Efficiency Video Coding (HEVC) removes redundancies in light field data, including intra-frame and inter-frame redundancies, while maintaining high reconstruction quality. The proposed scheme is the first of its kind to treat light field videos as mathematical dynamical systems, leverage on dynamic modes of acquired images, and gain flexible coding at various bitrates. Experimental results demonstrate our scheme’s superior compression efficiency and bitrate savings compared to the direct encoding of acquired images using HEVC codec.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122249921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng-Cheng Lee, Victor Lu, Chieh-Chih Wang, Wen-Chieh Lin
{"title":"LiDAR-Based Localization on Highways Using Raw Data and Pole-Like Object Features","authors":"Sheng-Cheng Lee, Victor Lu, Chieh-Chih Wang, Wen-Chieh Lin","doi":"10.1109/CVPRW59228.2023.00028","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00028","url":null,"abstract":"Poles on highways provide important cues for how a scan should be localized onto a map. However existing point cloud scan matching algorithms do not fully leverage such cues, leading to suboptimal matching accuracy in highway environments. To improve the ability to match in such scenarios, we include pole-like objects for lateral information and add this information to the current matching algorithm. First, we classify the points from the LiDAR sensor using the Random Forests classifier to find the points that represent poles. Each detected pole point will then generate a residual by the distance to the nearest pole in map. The pole residuals are later optimized along with the point-to-distribution residuals proposed in the normal distributions transform (NDT) using a nonlinear least squares optimization to get the localization result. Compared to the baseline (NDT), our proposed method obtains a 34% improvement in accuracy on highway scenes in the localization problem. In addition, our experiment shows that the convergence area is significantly enlarged, increasing the usability of the self-driving car localization algorithm on highway scenarios.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128178324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaurvi Goyal, Franco Di Pietro, N. Carissimi, Arren J. Glover, C. Bartolozzi
{"title":"MoveEnet: Online High-Frequency Human Pose Estimation with an Event Camera","authors":"Gaurvi Goyal, Franco Di Pietro, N. Carissimi, Arren J. Glover, C. Bartolozzi","doi":"10.1109/CVPRW59228.2023.00420","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00420","url":null,"abstract":"Human Pose Estimation (HPE) is crucial as a building block for tasks that are based on the accurate understanding of human position, pose and movements. Therefore, accuracy and efficiency in this block echo throughout a system, making it important to find efficient methods, that run at fast rates for online applications. The state of the art for mainstream sensors has made considerable advances, but event camera based HPE is still in its infancy. Event cameras boast high rates of data capture in a compact data structure, with advantages like high dynamic range and low power consumption. In this work, we present a system for a high frequency estimation of 2D, single-person Human Pose with event cameras. We provide an online system, that can be paired directly with an event camera to obtain high accuracy in real time. For quantitative results, we present our results on two large scale datasets, DHP19 and event-Human 3.6m. The system is robust to variance in the resolution of the camera and can run at up to 100Hz and an accuracy 89%.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121774850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meghna Kapoor, Suvam Patra, B. Subudhi, V. Jakhetiya, Ankur Bansal
{"title":"Underwater Moving Object Detection using an End-to-End Encoder-Decoder Architecture and GraphSage with Aggregator and Refactoring","authors":"Meghna Kapoor, Suvam Patra, B. Subudhi, V. Jakhetiya, Ankur Bansal","doi":"10.1109/CVPRW59228.2023.00597","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00597","url":null,"abstract":"Underwater environments are greatly affected by several factors, including low visibility, high turbidity, backscattering, dynamic background, etc., and hence pose challenges in object detection. Several algorithms consider convolutional neural networks to extract deep features and then object detection using the same. However, the dependency on the kernel’s size and the network’s depth results in fading relationships of latent space features and also are unable to characterize the spatial-contextual bonding of the pixels. Hence, they are unable to procure satisfactory results in complex underwater scenarios. To re-establish this relationship, we propose a unique architecture for underwater object detection where U-Net architecture is considered with the ResNet-50 backbone. Further, the latent space features from the encoder are fed to the decoder through a GraphSage model. GraphSage-based model is explored to reweight the node relationship in non-euclidean space using different aggregator functions and hence characterize the spatio-contextual bonding among the pixels. Further, we explored the dependency on different aggregator functions: mean, max, and LSTM, to evaluate the model’s performance. We evaluated the proposed model on two underwater benchmark databases: F4Knowledge and underwater change detection. The performance of the proposed model is evaluated against eleven state-of-the-art techniques in terms of both visual and quantitative evaluation measures.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115923285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juraj Fulir, Lovro Bosnar, H. Hagen, Petra Gospodnetić
{"title":"Synthetic Data for Defect Segmentation on Complex Metal Surfaces","authors":"Juraj Fulir, Lovro Bosnar, H. Hagen, Petra Gospodnetić","doi":"10.1109/CVPRW59228.2023.00465","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00465","url":null,"abstract":"Metal defect segmentation poses a great challenge for automated inspection systems due to the complex light reflection from the surface and lack of training data. In this work we introduce a real and synthetic defect segmentation dataset pair for multi-view inspection of a metal clutch part to overcome data shortage. Model pre-training on our synthetic dataset was compared to similar inspection datasets in the literature. Two techniques are presented to increase model training efficiency and prediction coverage in darker areas of the image. Results were collected over three popular segmentation architectures to confirm superior effectiveness of synthetic data and unveil various challenges of multi-view inspection.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"139 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131435698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-supervised Interest Point Detection and Description for Fisheye and Perspective Images","authors":"Marcela Mera-Trujillo, Shivang Patel, Yu Gu, Gianfranco Doretto","doi":"10.1109/CVPRW59228.2023.00691","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00691","url":null,"abstract":"Keypoint detection and matching is a fundamental task in many computer vision problems, from shape reconstruction, to structure from motion, to AR/VR applications and robotics. It is a well-studied problem with remarkable successes such as SIFT, and more recent deep learning approaches. While great robustness is exhibited by these techniques with respect to noise, illumination variation, and rigid motion transformations, less attention has been placed on image distortion sensitivity. In this work, we focus on the case when this is caused by the geometry of the cameras used for image acquisition, and consider the keypoint detection and matching problem between the hybrid scenario of a fisheye and a projective image. We build on a state-of-the-art approach and derive a self-supervised procedure that enables training an interest point detector and descriptor network. We also collected two new datasets for additional training and testing in this unexplored scenario, and we demonstrate that current approaches are suboptimal because they are designed to work in traditional projective conditions, while the proposed approach turns out to be the most effective.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Automatic Target Recognition in Low Data Regime using Semi-Supervised Learning and Generative Data Augmentation","authors":"Fadoua Khmaissia, H. Frigui","doi":"10.1109/CVPRW59228.2023.00521","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00521","url":null,"abstract":"We propose a new strategy to improve Automatic Target Recognition (ATR) from infrared (IR) images by leveraging semi-supervised learning and generative data augmentation.Our approach is twofold: first, we use an automatic detector’s outputs to augment the existing labeled and unlabeled data. Second, we introduce a confidence-guided data generative augmentation technique that focuses on learning from the most challenging regions of the feature space, to generate synthetic data which can be used as extra unlabeled data.We evaluate the proposed approach on a public dataset with IR imagery of civilian and military vehicles. We show that yields substantial percentage improvements in ATR performance relative to both the baseline fully supervised model trained using the existing data only, and a semi-supervised model trained without generative data augmentation. For instance, for the most challenging data partition, our method achieves a relative increase of 29.51% over the baseline fully supervised model and a relative improvement of 2.59% over the semi-supervised model. These results demonstrate the effectiveness of our approach in low-data regimes, where labeled data is limited or expensive to obtain.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130165979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenneth Chaney, Fernando Cladera Ojeda, Ziyun Wang, Anthony Bisulco, M. A. Hsieh, C. Korpela, Vijay R. Kumar, C. J. Taylor, Kostas Daniilidis
{"title":"M3ED: Multi-Robot, Multi-Sensor, Multi-Environment Event Dataset","authors":"Kenneth Chaney, Fernando Cladera Ojeda, Ziyun Wang, Anthony Bisulco, M. A. Hsieh, C. Korpela, Vijay R. Kumar, C. J. Taylor, Kostas Daniilidis","doi":"10.1109/CVPRW59228.2023.00419","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00419","url":null,"abstract":"We present M3ED, the first multi-sensor event camera dataset focused on high-speed dynamic motions in robotics applications. M3ED provides high-quality synchronized and labeled data from multiple platforms, including ground vehicles, legged robots, and aerial robots, operating in challenging conditions such as driving along off-road trails, navigating through dense forests, and performing aggressive flight maneuvers. Our dataset also covers demanding operational scenarios for event cameras, such as scenes with high egomotion and multiple independently moving objects. The sensor suite used to collect M3ED includes high-resolution stereo event cameras (1280×720), grayscale imagers, an RGB imager, a high-quality IMU, a 64-beam LiDAR, and RTK localization. This dataset aims to accelerate the development of event-based algorithms and methods for edge cases encountered by autonomous systems in dynamic environments.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134036883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}