{"title":"Multiple Fisheye Camera Calibration and Stereo Measurement Methods for Uniform Distance Errors throughout Imaging Ranges","authors":"Nobuhiko Wakai, Takeo Azuma, K. Nobori","doi":"10.23919/MVA51890.2021.9511376","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511376","url":null,"abstract":"This paper proposes calibration and stereo measurement methods that enable accurate distance and uniform distribution of the distance error throughout imaging ranges. In stereo measurement using two fisheye cameras, the distance error varies greatly depending on the measurement direction. To reduce the distance error, the proposed method introduces an effectual baseline weight into the stereo measurement using three or more fisheye cameras and their calibration. Accurate distance is obtained because this effectual baseline weight is the optimum weight in the maximum likelihood estimation. Experimental results show that the proposed methods can obtain an accurate distance with a 94% reduction in error and make the distribution of the distance error uniform.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121073359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muneaki Suzuki, Yoshitaka Kameya, Takuro Kutsuna, N. Mitsumoto
{"title":"Understanding the Reason for Misclassification by Generating Counterfactual Images","authors":"Muneaki Suzuki, Yoshitaka Kameya, Takuro Kutsuna, N. Mitsumoto","doi":"10.23919/MVA51890.2021.9511352","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511352","url":null,"abstract":"Explainable AI (XAI) methods contribute to understanding the behavior of deep neural networks (DNNs), and have attracted interest recently. For example, in image classification tasks, attribution maps have been used to indicate the pixels of an input image that are important to the output decision. Oftentimes, however, it is difficult to understand the reason for misclassification only from a single attribution map. In this paper, in order to enhance the information related to the reason for misclassification, we propose to generate several counterfactual images using generative adversarial networks (GANs). We empirically show that these counterfactual images and their attribution maps improve the interpretability of misclassified images. Furthermore, we additionally propose to generate transitional images by gradually changing the configurations of a GAN in order to understand clearly which part of the misclassified image cause the misclassification.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122786649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Seeing Farther Than Supervision: Self-supervised Depth Completion in Challenging Environments","authors":"Seiya Ito, Naoshi Kaneko, K. Sumi","doi":"10.23919/MVA51890.2021.9511354","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511354","url":null,"abstract":"This paper tackles the problem of learning a depth completion network from a series of RGB images and short-range depth measurements as a new setting for depth completion. Commodity RGB-D sensors used in indoor environments can provide dense depth measurements; however, their acquisition distance is limited. Recent depth completion methods train CNNs to estimate dense depth maps in a supervised/self-supervised manner while utilizing sparse depth measurements. For self-supervised learning, indoor environments are challenging due to many non-textured regions, leading to the problem of inconsistency. To overcome this problem, we propose a self-supervised depth completion method that utilizes optical flow from two RGB-D images. Because optical flow provides accurate and robust correspondences, the ego-motion can be estimated stably, which can reduce the difficulty of depth completion learning in indoor environments. Experimental results show that the proposed method outperforms the previous self-supervised method in the new depth completion setting and produces qualitatively adequate estimates.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125921633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Group Activity Recognition Using Joint Learning of Individual Action Recognition and People Grouping","authors":"Chihiro Nakatani, Kohei Sendo, N. Ukita","doi":"10.23919/MVA51890.2021.9511390","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511390","url":null,"abstract":"This paper proposes joint learning of individual action recognition and people grouping for improving group activity recognition. By sharing the information between two similar tasks (i.e., individual action recognition and people grouping) through joint learning, errors of these two tasks are mutually corrected. This joint learning also improves the accuracy of group activity recognition. Our proposed method is designed to consist of any individual action recognition methods as a component. The effectiveness is validated with various IAR methods. By employing existing group activity recognition methods for ensembling with the proposed method, we achieved the best performance compared to the similar SOTA group activity recognition methods.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132747185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Position Estimation of Pedestrians in Surveillance Video Using Face Detection and Simple Camera Calibration","authors":"Toshio Sato, Xin Qi, Keping Yu, Zheng Wen, Yutaka Katsuyama, Takuro Sato","doi":"10.23919/MVA51890.2021.9511348","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511348","url":null,"abstract":"Pedestrian position estimation in videos is an important technique for enhancing surveillance system applications. Although many studies estimate pedestrian positions by using human body detection, its usage is limited when the entire body expands outside of the field of view. Camera calibration is also important for realizing accurate position estimation. Most surveillance cameras are not adjusted, and it is necessary to establish a method for easy camera calibration after installation. In this paper, we propose an estimation method for pedestrian positions using face detection and anthropometric properties such as statistical face lengths. We also investigate a simple method for camera calibration that is suitable for actual uses. We evaluate the position estimation accuracy by using indoor surveillance videos.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115187614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaofeng Niu, Yuichiro Fujimoto, M. Kanbara, H. Kato
{"title":"HMA-Depth: A New Monocular Depth Estimation Model Using Hierarchical Multi-Scale Attention","authors":"Zhaofeng Niu, Yuichiro Fujimoto, M. Kanbara, H. Kato","doi":"10.23919/MVA51890.2021.9511345","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511345","url":null,"abstract":"Monocular depth estimation is an essential technique for tasks like 3D reconstruction. Although many works have emerged in recent years, they can be improved by better utilizing the multi-scale information of the input images, which is proved to be one of the keys in generating high-quality depth estimations. In this paper, we propose a new monocular depth estimation method named HMA-Depth, in which we follow the encoder-decoder scheme and combine several techniques such as skip connections and the atrous spatial pyramid pooling. To obtain more precise local information from the image while keeping a good understanding of the global context, a hierarchical multi-scale attention module is adopted and its outputs are combined to generate the final output that is with both good details and good overall accuracy. Experimental results on two commonly-used datasets prove that HMA-Depth can outperform the existing approaches. Code is available11https://github.com/saranew/HMADepth.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115196631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal Extension for Encoder-Decoder-based Crowd Counting Approaches","authors":"T. Golda, F. Krüger, J. Beyerer","doi":"10.23919/MVA51890.2021.9511351","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511351","url":null,"abstract":"Crowd counting is an important aspect to safety monitoring at mass events and can be used to initiate safety measures in time. State-of-the-art encoder-decoder architectures are able to estimate the number of people in a scene precisely. However, since most of the proposed methods are based to solely operate on single-image features, we observe that estimated counts for aerial video sequences are inherently noisy, which in turn reduces the significance of the overall estimates. In this paper, we propose a simple temporal extension to said encoder-decoder architectures that incorporates local context from multiple frames into the estimation process. By applying the temporal extension a state-of-the-art architectures and exploring multiple configuration settings, we find that the resulting estimates are more precise and smoother over time.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115690810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Harrigan, S. Coleman, M. Ker, P. Yogarajah, Z. Fang, Chengdong Wu
{"title":"ROT-Harris: A Dynamic Approach to Asynchronous Interest Point Detection","authors":"S. Harrigan, S. Coleman, M. Ker, P. Yogarajah, Z. Fang, Chengdong Wu","doi":"10.23919/MVA51890.2021.9511407","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511407","url":null,"abstract":"Event-based vision sensors are a paradigm shift in the way that visual information is obtained and processed. These devices are capable of low-latency transmission of data which represents the scene dynamics. Additionally, low-power benefits make the sensors popular in finite-power scenarios such as high-speed robotics or machine vision applications where latency in visual information is desired to be minimal. The core datatype of such vision sensors is the ‘event’ which is an asynchronous per-pixel signal indicating a change in light intensity at an instance in time corresponding to the spatial location of that sensor on the array. A popular approach to event-based processing is to map events onto a 2D plane over time which is comparable with traditional imaging techniques. However, this paper presents a disruptive approach to event data processing that uses a tree-based filter framework that directly processes raw event data to extract events corresponding to interest point features, which is then combined with a Harris interest point approach to isolate features. We hypothesise that since the tree structure contains the same spatial information as a 2D surface mapping, Harris may be applied directly to the content of the tree, bypassing the need for transformation to the 2D plane. Results illustrate that the proposed approach performs better than other state-of-the-art approaches with limited compromise on the run-time performance.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127265674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distant Bird Detection for Safe Drone Flight and Its Dataset","authors":"Sanae Fujii, Kazutoshi Akita, N. Ukita","doi":"10.23919/MVA51890.2021.9511386","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511386","url":null,"abstract":"For the safe flight of drones, they must avoid the attacks of aggressive birds. These birds move very fast and must be detected far enough away. In recent years, deep learning has made it possible to detect small distant objects in RGB camera images. Since these methods are learning-based, they require a large amount of training images, but there are no publicly-available datasets for bird detection taken from drones. In this work, we propose a new dataset captured by a drone camera. Our dataset consists of 34,467 bird instances in 21,837 images that were captured in various locations and conditions. Our experimental results show that, even with the SOTA detection model, our dataset is sufficiently challenging. We also demonstrated that (1) several standard techniques for improving detection methods (e.g., data augmentation) are inappropriate for our challenging dataset, and (2) carefully-selected techniques can improve the detection performance.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125359655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masaki Yamazaki, Xingchao Peng, Kuniaki Saito, Ping Hu, Kate Saenko, Y. Taniguchi
{"title":"Weakly Supervised Domain Adaptation using Super-pixel labeling for Semantic Segmentation","authors":"Masaki Yamazaki, Xingchao Peng, Kuniaki Saito, Ping Hu, Kate Saenko, Y. Taniguchi","doi":"10.23919/MVA51890.2021.9511365","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511365","url":null,"abstract":"Deep learning for semantic segmentation requires a large amount of labeled data, but manually annotating images are very expensive and time consuming. To overcome the limitation, unsupervised domain adaptation methods adapt a segmentation model trained on a labeled source domain (synthetic data) to an unlabeled target domain (real-world scenes). However, the unsupervised methods have a poor performance than the supervised methods with target domain labels. In this paper, we propose a novel weakly supervised domain adaptation using super-pixel labeling for semantic segmentation. The proposed method reduces annotation cost by estimating a suitable labeling area calculated from the Entropy-based cost of a previously learned segmentation model. In addition, we generate the new pseudo-labels by applying fully connected Conditional Random Field model over the pseudo-labels obtained using an unsupervised domain adaptation. We show that our proposed method is a powerful approach for reducing annotation cost.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"27 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116709710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}