Zhiyuan Cheng, Andong Lu, Zhang Zhang, Chenglong Li, Liang Wang
{"title":"Fusion Tree Network for RGBT Tracking","authors":"Zhiyuan Cheng, Andong Lu, Zhang Zhang, Chenglong Li, Liang Wang","doi":"10.1109/AVSS56176.2022.9959406","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959406","url":null,"abstract":"RGBT tracking is often affected by complex scenes (i.e., occlusions, scale changes, noisy background, etc). Existing works usually adopt a single-strategy RGBT tracking fusion scheme to handle modality fusion in all scenarios. However, due to the limitation of fusion model capacity, it is difficult to fully integrate the discriminative features between different modalities. To tackle this problem, we propose a Fusion Tree Network (FTNet), which provides a multi-strategy fusion model with high capacity to efficiently fuse different modalities. Specifically, we combine three kinds of attention modules (i.e., channel attention, spatial attention, and location attention) in a tree structure to achieve multi-path hybrid attention in the deeper convolutional stages of the object tracking network. Extensive experiments are performed on three RGBT tracking datasets, and the results show that our method achieves superior performance among state-of-the-art RGBT tracking models.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116937914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcos Rodrigo, E. González-Sosa, Carlos Cuevas, N. García
{"title":"UPM-GTI-Face: A dataset for the evaluation of the impact of distance and masks in face detection and recognition systems","authors":"Marcos Rodrigo, E. González-Sosa, Carlos Cuevas, N. García","doi":"10.1109/AVSS56176.2022.9959558","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959558","url":null,"abstract":"We present a novel dataset for the evaluation of face detection and recognition algorithms in challenging surveillance scenarios. The dataset consists in 4K images of different subjects captured at annotated distances ranging from 1 to 30 meters, both in indoor and outdoor environments, and under two face mask conditions (with and without). To the best of our knowledge, this is the only existing dataset that addresses the joint impact of masks and distances in a rigorous manner. We also propose an end-to-end fully automatic face detection and recognition system to provide baseline results on this dataset. Face detection is performed using Tiny Faces network, while face recognition is performed using VGG Face network. Experimental results show very high detection and recognition rates up to a distance of 20 meters, where the impact of distance is clear (especially for the latter). The use of face masks degrades the detection range and produces less consistent recognition results.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127165370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryosuke Shimoya, Takashi Morimoto, J. van Baar, P. Boufounos, Yanting Ma, Hassan Mansour
{"title":"Learning Occlusion-Aware Dense Correspondences for Multi-Modal Images","authors":"Ryosuke Shimoya, Takashi Morimoto, J. van Baar, P. Boufounos, Yanting Ma, Hassan Mansour","doi":"10.1109/AVSS56176.2022.9959354","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959354","url":null,"abstract":"We introduce a scalable multi-modal approach to learn dense, i.e., pixel-level, correspondences and occlusion maps, between images in a video sequence. The problems of finding dense correspondences and occlusion maps are fundamental in computer vision. In this work we jointly train a deep network to tackle both, with a shared feature extraction stage. We use depth and color images with ground truth optical flow and occlusion maps to train the network end-to-end. From the multi-modal input, the network learns to estimate occlusion maps, optical flows, and a correspondence embedding providing a meaningful latent feature space. We evaluate the performance on a dataset of images derived from synthetic characters, and perform a thorough ablation study to demonstrate that the proposed components of our architecture combine to achieve the lowest correspondence error. The scalability of our proposed method comes from the ability to incorporate additional modalities, e.g., infrared images.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123237647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taiki Hamada, T. Minematsu, Atsushi Shimada, Fumiya Okubo, Yuta Taniguchi
{"title":"Background Subtraction Network Module Ensemble for Background Scene Adaptation","authors":"Taiki Hamada, T. Minematsu, Atsushi Shimada, Fumiya Okubo, Yuta Taniguchi","doi":"10.1109/AVSS56176.2022.9959316","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959316","url":null,"abstract":"Background subtraction networks outperform traditional hand-craft background subtraction methods. The main advantage of background subtraction networks is their ability to automatically learn background features for training scenes. When applying the trained network to new target scenes, adapting the network to the new scenes is crucial. However, few studies have focused on reusing multiple trained models for new target scenes. Considering background changes have several categories, such as illumination changes, a model trained for each background scene can work effectively for the target scene similar to the training scene. In this study, we propose a method to ensemble the module networks trained for each background scene. Experimental results show that the proposed method is significantly more accurate compared with the conventional methods in the target scene by tuning with only a few frames.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117259698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Camera Based High Spatio-Temporal Resolution Video Generation For Wide Area Surveillance","authors":"H. U. Suluhan, H. Ateş, B. Gunturk","doi":"10.1109/AVSS56176.2022.9959711","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959711","url":null,"abstract":"Wide area surveillance (WAS) requires high spatiotemporal resolution (HSTR) video for better precision. As an alternative to expensive WAS systems, low-cost hybrid imaging systems can be used. This paper presents the usage of multiple video feeds for the generation of HSTR video as an extension of reference based super resolution (RefSR). One feed captures video at high spatial resolution with low frame rate (HSLF) while the other captures low spatial resolution and high frame rate (LSHF) video simultaneously for the same scene. The main purpose is to create an HSTR video from the fusion of HSLF and LSHF videos. In this paper we propose an end-to-end trainable deep network that performs optical flow (OF) estimation and frame reconstruction by combining inputs from both video feeds. The proposed architecture provides significant improvement over existing video frame interpolation and RefSR techniques in terms of PSNR and SSIM metrics and can be deployed on drones with dual cameras.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"26 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132870904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated Blind Deblurring Method via Video-based Estimation in Next Point Spread Functions for Surveillance","authors":"A. Güven, Ceren Özçelik, D. M. Sazak","doi":"10.1109/AVSS56176.2022.9959473","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959473","url":null,"abstract":"Blind deblurring has been attracting increased attention. In real-life problems, high-resolution images are needed to process and the blurring function, point spread function (PSF), is mostly unknown, especially in the surveillance systems such as camera integrated payload drop with a parachute. The PSFs are dependent on their previous functions, so we perform the deblurring process faster with our proposed model by integrating a previously prepared deep learning method. Our system consists of four phases: (i) enhancing images with an existing deep learning method, (ii) obtaining PSFs, (iii) predicting the next PSFs with our model, and (iv) enhancing the images with the wienerfiltering we developed. The number of PSFs to be estimated was experimentally found as the point at which the PSNR value began to decrease in the test images. Convolutional LSTM layers were used for our model which has been compared with other state-of-the-art models in terms of performance and running time.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117136654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Thioune, Sanaa Chafik, Ankur Mahtani, Olivier Laurendin, Safia Boudra
{"title":"FPDM: Fisheye Panoptic segmentation dataset for Door Monitoring","authors":"Mohamed Thioune, Sanaa Chafik, Ankur Mahtani, Olivier Laurendin, Safia Boudra","doi":"10.1109/AVSS56176.2022.9959151","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959151","url":null,"abstract":"Most existing panoptic segmentation datasets are not suited for applications in the railway environment. This paper introduces a new dataset composed of video feeds taken in the vicinity of train doors. It is aimed at the training of deep learning algorithms to identify the obstacles between doors to ensure passenger safety during boarding and to reduce boarding time. The dataset is acquired from fisheye cameras located at the train doors. The data is annotated entirely manually. The Fisheye Panoptic Door Monitoring dataset (FPDM) contains 3952 images with their annotation masks featuring 18 of the most frequent instance categories in the vicinity of train doors. FPDM answers the panoptic segmentation challenge by offering a new challenging dataset for the computer vision community. We present detailed information on the process of acquisition, annotation, and division of the data into training and validation sets in addition with an evaluation of an existing deep learning method.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126946653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Rafique, Muhamamd Ishfaq Hussain, M. Hassan, W. Jung, Bong-Joong Kim, M. Jeon
{"title":"Automated Single Particle Growth Measurement using Segmentation","authors":"M. Rafique, Muhamamd Ishfaq Hussain, M. Hassan, W. Jung, Bong-Joong Kim, M. Jeon","doi":"10.1109/AVSS56176.2022.9959296","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959296","url":null,"abstract":"Fine-grain imaging is revealing secrets of nature with every passing day and artificial intelligence is reducing the manual effort required for detailed analysis. This work proposes an automated growth measurement of a particle in electron microscopic images in real-time. The particle selected in this study is an Au spiky nanoparticle (SNP) that develops spikes over the course of its growth. In this study, multiple techniques from conventional and sophisticated algorithms are used to segment the particle using supervised and unsupervised learning techniques. A comprehensive analysis of the automated techniques is presented with qualitative and quantitative results.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125525262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prashant W. Patil, Jasdeep Singh, Praful Hambarde, Ashutosh Kulkarni, S. Chaudhary, S. Murala
{"title":"Robust Unseen Video Understanding for Various Surveillance Environments","authors":"Prashant W. Patil, Jasdeep Singh, Praful Hambarde, Ashutosh Kulkarni, S. Chaudhary, S. Murala","doi":"10.1109/AVSS56176.2022.9959513","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959513","url":null,"abstract":"Automated video-based applications are a highly demanding technique from a security perspective, where detection of moving objects i.e., moving object segmentation (MOS) is performed. Therefore, we have proposed an effective solution with a spatio-temporal squeeze excitation mechanism (SqEm) based multi-level feature sharing encoder-decoder network for MOS. Here, the SqEm module is proposed to get prominent foreground edge information using spatio-temporal features. Further, a multi-level feature sharing residual decoder module is proposed with respective SqEm features and previous output features for accurate and consistent foreground segmentation. To handle the foreground or background class imbalance issue, we propose a region of interest-based edge loss. The extensive experimental analysis on three databases is conducted. Result analysis and ablation study proved the robustness of the proposed network for unseen video understanding over SOTA methods.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126826494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deformable Modules for Flexible Feature Sampling on Vision Transformer","authors":"Chanjong Park, Dongha Bahn, Jae-il Jung","doi":"10.1109/AVSS56176.2022.9959253","DOIUrl":"https://doi.org/10.1109/AVSS56176.2022.9959253","url":null,"abstract":"Vision transformers have shown that the self-attention mechanism performs well in the computer vision field. However, since such transformers are based on data sampled from fixed areas, there is a limit to efficiently learning the important features in images. To compensate, we propose two modules based on the deformable operation: deformable patch embedding and deformable pooling. Deformable patch embedding consists of a hybrid structure of standard and deformable convolutions, and adaptively samples features from an image. The deformable pooling module also has a similar structure to the embedding module, but it not only samples data flexibly after self-attention but also allows the transformer to learn spatial information of various scales. The experimental results show that the transformer with the proposed modules converges faster and outperforms various vision transformers on image classification (ImageNet-1K) and object detection (MS-COCO).","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116605704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}