{"title":"Self-Supervised Deep Fisheye Image Rectification Approach using Coordinate Relations","authors":"Masaki Hosono, E. Simo-Serra, Tomonari Sonoda","doi":"10.23919/MVA51890.2021.9511349","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511349","url":null,"abstract":"With the ascent of wearable camera, dashcam, and autonomous vehicle technology, fisheye lens cameras are becoming more widespread. Unlike regular cameras, the videos and images taken with fisheye lens suffer from significant lens distortion, thus having detrimental effects on image processing algorithms. When the camera parameters are known, it is straight-forward to correct the distortion, however, without known camera parameters, distortion correction becomes a non-trivial task. While learning-based approaches exist, they rely on complex datasets and have limited generalization. In this work, we propose a CNN-based approach that can be trained with readily available data. We exploit the fact that relationships between pixel coordinates remain stable after homogeneous distortions to design an efficient rectification model. Experiments performed on the cityscapes dataset show the effectiveness of our approach. Our code is available at GitHub11https://github.com/MasakHosono/SelfSupervisedFisheyeRectification.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128103181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A baseline for semi-supervised learning of efficient semantic segmentation models","authors":"I. Grubisic, Marin Orsic, Sinisa Segvic","doi":"10.23919/MVA51890.2021.9511402","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511402","url":null,"abstract":"Semi-supervised learning is especially interesting in the dense prediction context due to high cost of pixel-level ground truth. Unfortunately, most such approaches are evaluated on outdated architectures which hamper research due to very slow training and high requirements on GPU RAM. We address this concern by presenting a simple and effective baseline which works very well both on standard and efficient architectures. Our baseline is based on one-way consistency and nonlinear geometric and photometric perturbations. We show advantage of perturbing only the student branch and present a plausible explanation of such behaviour. Experiments on Cityscapes and CIFAR-10 demonstrate competitive performance with respect to prior work.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"521 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115351106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Occlusion-Robust 3D Hand Pose Estimation from a Single RGB Image","authors":"Asuka Ishii, Gaku Nakano, Tetsuo Inoshita","doi":"10.23919/MVA51890.2021.9511389","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511389","url":null,"abstract":"We propose an occlusion-robust network for 3D hand pose estimation from a single RGB image. Severe occlusions degrade the estimation accuracy of not only occluded keypoints but also visible keypoints. Since the existing methods based on a deep neural network perform convolutions on all keypoints regardless of visibility, inaccurate features from occluded keypoints affect the localization of visible keypoints. To suppress the influence of occluded keypoints, our proposed deep neural network consists of three modules: a 2D heatmap generator, parallel sub-joints network (PSJNet), and an ensemble network (EN). First, the 2D position of all keypoints in an input image is predicted as a 2D heatmap, similar to the existing methods. Then, PSJNet, which consists of several graph convolutional networks (GCN) in parallel, estimates multiple incomplete 3D poses in which some of the keypoints have been removed. Each GCN performs convolutions on a limited number of keypoints, therefore, features from occluded keypoints do not spread to the whole pose. Finally, EN merges the incomplete poses into a single 3D pose by selecting accurate positions from them. Experimental results on a public dataset RHD demonstrate that the proposed method outperforms the existing methods in the case of both small and severe occlusions.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127517361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Information Assistance Neural Network for VideoPose3D-based Monocular 3D Pose Estimation","authors":"Hao Wang, Dingli Luo, T. Ikenaga","doi":"10.23919/MVA51890.2021.9511380","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511380","url":null,"abstract":"3D pose estimation based on a monocular camera can be applied to various fields such as human-computer interaction and human action recognition. As a two-stage 3D pose estimator, VideoPose3D achieves state-of-the-art accuracy. However, because of the limitation of two-stage processing, image information is partially lost in the process of mapping 2D poses to 3D space, which results in limited final accuracy. This paper proposes an image-assisting pose estimation model and a back-projection based offset generating module. The image-assisting pose estimation model consists of a 2D pose processing branch and an image processing branch. Image information is processed to generate an offset to refine the intermediate 3D pose produced by the 2D pose processing network. The back-projection based offset generating module projects the intermediate 3D poses to 2D space and calculates the error between the projection and input 2D pose. With the error combining with extracted image feature, the neural network generates an offset to decrease the error. By evaluation, the accuracy on each action of Human3.6M dataset gets an average improvement of 0.9 mm over the VideoPose3D baseline.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124450016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Influence of Viewpoint Change for Metric Learning","authors":"Marco Filax, F. Ortmeier","doi":"10.23919/MVA51890.2021.9511344","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511344","url":null,"abstract":"Physical objects imaged through a camera change their visual representation based on various factors, c.g., illumination, occlusion, or viewpoint changes. Thus, it is the inevitable goal in computer vision systems to use mathematical representations of these objects robust to various changes and yet sufficient to determine even minor differences to distinguish objects. However, finding these powerful representations is challenging if the amount of data is limited, such as in few-shot learning problems. In this work, we investigate the influence of viewpoint changes in modern recognition systems in the context of metric learning problems, in which fine-grained differences differentiate objects based on their learned numeric representation. Our results demonstrate that restricting the degrees of freedom, especially by fixing the virtual viewpoint using synthetic frontal views, elevates the overall performance. We await that our observation of an increased performance using rectified patches is persistent and reproducible in other scenarios.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"287 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113996108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziwei Dong, Tingting Hu, Ryuji Fuchikami, T. Ikenaga
{"title":"Encoding-free Incrementing Hough Transform for High Frame Rate and Ultra-low Delay Straight-line Detection","authors":"Ziwei Dong, Tingting Hu, Ryuji Fuchikami, T. Ikenaga","doi":"10.23919/MVA51890.2021.9511359","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511359","url":null,"abstract":"High frame rate and ultra-low delay straight-line detection plays an increasingly important role in highly automated factories that call for straight-line features to achieve swift locations in real scenes. However, vision systems based on CPU/GPU have a fixed delay between image capture and detection, making straight-line detection challenging to reach an ultra-low delay. Achieving detection nearly simultaneous with capture on the same image is considered. This paper proposes (A) an encoding-free incrementing Hough transform and (B) a partially compressed line parameter space to implement a straight-line detection core on an FPGA board. The encoding-free incrementing Hough transform directly calculates line parameters only by incrementing and initialization while capturing an image. Furthermore, the partially compressed line parameter space reduces the required memory resources and the path delay under the premise of accurate vote recordings for every line feature. The evaluation result shows that the proposals achieve as accurate detection (RMSE of θ on 0.0057, and RMSE of p on 2.09) as standard Hough transform (RMSE of θ on 0.0057, and RMSE of p on 2.13) and the designed detection core processes VGA (640 × 480) videos at 1.358 ms/frame delay.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131288360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jihang Zhang, Dongmei Huang, Tingting Hu, Ryuji Fuchikami, T. Ikenaga
{"title":"Critically Compressed Quantized Convolution Neural Network based High Frame Rate and Ultra-Low Delay Fruit External Defects Detection","authors":"Jihang Zhang, Dongmei Huang, Tingting Hu, Ryuji Fuchikami, T. Ikenaga","doi":"10.23919/MVA51890.2021.9511388","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511388","url":null,"abstract":"High frame rate and ultra-low delay fruit external defects detection plays a key role in high-efficiency and high-quality oriented fruit products manufacture. However, current traditional computer vision based commercial solutions still lack capability of detecting most types of deceptive external defects. Although recent researches have discovered deep learning 's great potential towards defects detection, solutions with large general CNNs are too slow to adapt to high-speed factory pipelines. This paper proposes a critically compressed separable convolution network, and bit depth degression quantization to further transform the network for FPGA acceleration, which makes the implementation of CNN on High Frame Rate and Ultra-Low Delay Vision System possible. With minimal searched specialized structure, the critically compressed separable convolution network is able to handle external quality classification task with a minuscule number of parameters. By assigning degressive bit depth to different layers according to degressive bit depth importance, the customized quantization is able to compress our network more efficiently than traditional method. The proposed network consists 0.1% weight size of MobileNet (alpha = 0.25), while only a 1.54% drop of overall accuracy on validation set is observed. The hardware estimation shows the network classification unit is able to work at 0.672 ms delay with the resolution of 100*100 and up to 6 classification units parallelly.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130940090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takahiro Kushida, Kenichiro Tanaka, Takuya Funatomi, K. Tahara, Y. Kagawa, Y. Mukaigawa
{"title":"Practical Descattering of Transmissive Inspection Using Slanted Linear Image Sensors","authors":"Takahiro Kushida, Kenichiro Tanaka, Takuya Funatomi, K. Tahara, Y. Kagawa, Y. Mukaigawa","doi":"10.23919/MVA51890.2021.9511372","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511372","url":null,"abstract":"This paper presents an industry-ready descattering method that is easily applied to a food production line. The system consists of multiple sets comprising a linear image sensor and linear light source, which are slanted at different angles. The images captured by these sensors, which are partially clear along the perpendicular direction to the sensor, are computationally integrated into a single clear image over the frequency domain. We assess the effectiveness of the proposed method by simulation and by our prototype system, which demonstrates the feasibility of the proposed method on an actual production line.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115331638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roberto J. Lópcz-Sastrc, Marcos Baptista-Ríos, F. J. Acevedo-Rodríguez, P. Martín-Martín, S. Maldonado-Bascón
{"title":"Live Video Action Recognition from Unsupervised Action Proposals","authors":"Roberto J. Lópcz-Sastrc, Marcos Baptista-Ríos, F. J. Acevedo-Rodríguez, P. Martín-Martín, S. Maldonado-Bascón","doi":"10.23919/MVA51890.2021.9511355","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511355","url":null,"abstract":"The problem of action detection in untrimmed videos consists in localizing those parts of a certain video that can contain an action. Typically, state-of-the-art approaches to this problem use a temporal action proposals (TAPs) generator followed by an action classifier module. Moreover, TAPs solutions are learned from a supervised setting, and need the entire video to be processed to produce effective proposals. These properties become a limitation for certain real applications in which a system requires to know the content of the video in an online fashion. To do so, in this work we introduce a live video action detection application which integrates the action classifier step with an unsupervised and online TAPs generator. We evaluate, for the first time, the precision of this novel pipeline for the problem of action detection in untrimmed videos. We offer a thorough experimental evaluation in Activi-tyNet dataset, where our unsupervised model can compete with the state-of-the-art supervised solutions.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124985649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Minoura, Tsubasa Hirakawa, Takayoshi Yamashita, H. Fujiyoshi, Mitsuru Nakazawa, Yeongnam Chae, B. Stenger
{"title":"Action Spotting and Temporal Attention Analysis in Soccer Videos","authors":"H. Minoura, Tsubasa Hirakawa, Takayoshi Yamashita, H. Fujiyoshi, Mitsuru Nakazawa, Yeongnam Chae, B. Stenger","doi":"10.23919/MVA51890.2021.9511342","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511342","url":null,"abstract":"Action spotting is the task of finding a specific action in a video. In this paper, we consider the task of spotting actions in soccer videos, e.g., goals, player substitutions, and card scenes, which are temporally sparse within a complete game. We spot actions using a Transformer model, which allows capturing important features before and after action scenes. Moreover, we analyze which time instances the model focuses on when predicting an action by observing the internal weights of the transformer. Quantitative results on the public SoccerNet dataset show that the proposed method achieves an mAP of 81.6%, a significant improvement over previous methods. In addition, by analyzing the attention weights, we discover that the model focuses on different temporal neighborhoods for different actions.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121948415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}