{"title":"Bi-directional Recurrent MVSNet for High-resolution Multi-view Stereo","authors":"Taku Fujitomi, Seiya Ito, Naoshi Kaneko, K. Sumi","doi":"10.23919/MVA51890.2021.9511358","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511358","url":null,"abstract":"Learning-based multi-view stereo regularizes cost volumes containing spatial information to reduce noise and improve the quality of a depth map. Cost volume regularization using 3D CNNs consumes a large amount of memory, making it difficult to scale up the network architecture. Recent work proposed a cost-volume regularization method that applies 2D convolutional GRUs and significantly reduces memory consumption. However, this uni-directional recurrent processing has a narrower receptive field than 3D CNNs because the regularized cost at a time step does not contain information about future time steps. In this paper, we propose a cost volume regularization method using bi-directional GRUs that expands the receptive field in the depth direction. In our experiments, our proposed method significantly outperforms the conventional methods in several benchmarks while maintaining low memory consumption.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121050218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongmei Huang, Jihang Zhang, Tingting Hu, Ryuji Fuchikami, T. Ikenaga
{"title":"Contextual Information based Network with High-Frequency Feature Fusion for High Frame Rate and Ultra-Low Delay Small-Scale Object Detection","authors":"Dongmei Huang, Jihang Zhang, Tingting Hu, Ryuji Fuchikami, T. Ikenaga","doi":"10.23919/MVA51890.2021.9511387","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511387","url":null,"abstract":"High frame rate and ultra-low delay small-scale object detection plays an important role in factory automation for its timely and accurate reaction. Although many CNN based detection methods have been proposed to improve the accuracy of small object detection for the low resolution and large gap between the object and the background, it is difficult to achieve a trade-off between accuracy and speed. For the pursuit of ultra-low delay processing by utilizing FPGA, this paper proposes: (A) IoU and distance based loss function, (B) Contextual information with high temporal correlation based parallel detection, (C) High frequency feature fusion for enhancing low-bit networks. The proposed methods achieve 45.3 % mAP for test sequences, which is only 0.7 % mAP lower compared with the general method. Meanwhile, the size of the model has been compressed to 1.94 % of the original size and reaches a speed of 278 fPs on FPGA and 15 fPs on GPU.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"17 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126034196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting an Iconic Pose From an Action Video","authors":"Geethu Miriam Jacob, B. Stenger","doi":"10.23919/MVA51890.2021.9511347","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511347","url":null,"abstract":"This paper presents a method for selecting an iconic pose frame from an action video. An iconic pose frame is a frame showing a representative pose, distinct from other actions. We first extract a diverse set of keyframes from the video using unsupervised video summarization. A classification loss ensures that the selected frames retain high action classification accuracy. To find iconic poses, we introduce two loss terms, an Extreme Pose Loss, encouraging selecting poses far from the mean pose, and a Frame Contrastive Loss, which encourages poses from the same action to be similar. In a user preference study on UCF-101 videos we show that the automatically selected iconic pose keyframes are preferred to manually selected ones in 48% of cases.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123566885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention Mining Branch for Optimizing Attention Map","authors":"Takaaki Iwayoshi, Masahiro Mitsuhara, Masayuki Takada, Tsubasa Hirakawa, Takayoshi Yamashita, H. Fujiyoshi","doi":"10.23919/MVA51890.2021.9511357","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511357","url":null,"abstract":"Attention branch networks (ABNs) can achieve high accuracy by visualizing the attention area of the network during inference and utilizing it in the recognition process. However, if the attention area does not highlight the target object to be recognized, it may cause recognition failure. While there is a method for fine-tuning the ABN using attention maps modified by human knowledge, it takes up a lot of labor and time because the attention map needs to be modified manually. In this paper, we propose a method that automatically optimizes the attention map by introducing an attention mining branch to the ABN. Our evaluation experiments show that the proposed method improves the recognition accuracy and obtains an attention map that appropriately focuses on the target object to be recognized.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125555759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Saliency based Subject Selection for Diverse Image Captioning","authors":"Quoc-An Luong, Duc Minh Vo, A. Sugimoto","doi":"10.23919/MVA51890.2021.9511360","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511360","url":null,"abstract":"Image captioning has drawn more and more attention because of its practical usefulness in many multimedia applications. Multiple criteria such as accuracy, detail or diversity exist to evaluate the quality of generated captions. Among them, diversity is the most difficult because for a given image, its multiple captions should be generated while retaining their accuracy. We approach to diverse image captioning by explicitly selecting objects in an image one by one as a subject in generating captions. Our method has three main steps: (1) After generating scene graph of a given image, we first give selection priority to the nodes (namely, subjects) in the scene graph based on the size and visual saliency of objects. (2) With a selected subject, we prune a portion of the scene graph structure that is irrelevant to the subject to have subject-oriented scene graph for accurate captioning. (3) We convert the subject-oriented scene graph into its more sentence-friendly abstract meaning representation (AMR) to generate the caption whose the subject is the selected root. In this way, we can generate captions whose subjects are different from each other, achieving diversity. Our proposed method achieves comparable results with other methods in both diversity and accuracy.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"5 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129417177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaechul Kim, K. Taguchi, Yusuke Hayashi, Jungo Miyazaki, H. Fujiyoshi
{"title":"Facial landmark detection transfer learning for a specific user in driver status monitoring systems","authors":"Jaechul Kim, K. Taguchi, Yusuke Hayashi, Jungo Miyazaki, H. Fujiyoshi","doi":"10.23919/MVA51890.2021.9511385","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511385","url":null,"abstract":"The wide variety of human faces make it nearly impossible to prepare a complete training data set for facial landmark detection. Because of this, the performance of facial landmark detection is unlikely to be sufficient for driver status monitoring (DSM) systems. To improve the performance for a specific person (SP) by collecting data about that person, we propose the generator and discriminator model using the Lucas-Kanade assistance (GDA) algorithm for compiling a training data set. Even when data for a specific user can be collected, another issue is how to efficiently, effectively, and quickly re-train the model using an insufficient data set. To address this problem, we propose a novel method of transfer learning in the context of composite backbone networks (GBNet). The assistant backbone of GBNet is trained on a large unspecified people (USP) data set in the source domain and transfers its representation to the lead backbone, which is trained by a small SP data set in the target domain. In addition, we design an assistance loss function with output that is not only close to the SP data set, but also consistent with a USP data set with respect to labeled images. We test the proposed method using the 300 Videos in the Wild (300VW) data set and our own data set. Furthermore, show that the proposed method improves the stability of predictions. We expect our method to contribute to the realization of stable DSM systems.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126663479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kazutoshi Akita, Masayoshi Hayama, Haruya Kyutoku, N. Ukita
{"title":"AVM Image Quality Enhancement by Synthetic Image Learning for Supervised Deblurring","authors":"Kazutoshi Akita, Masayoshi Hayama, Haruya Kyutoku, N. Ukita","doi":"10.23919/MVA51890.2021.9511398","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511398","url":null,"abstract":"An Around View Monitoring (AVM) system is widely used to allow a driver to watch the situation around a car. The AVM image is generated by image distortion correction and viewpoint transformation for images captured by wide view-angle cameras installed on the car. However, the AVM image is blurred due to these transformations. This blur impairs the visibility of the driver. While many deblurring methods based on CNN have been proposed, these general-purpose de-blurring methods are not designed for the AVM image. (1) Since the blur level in the AVM image is region-dependent, deblurring for the AVM should also be region-dependent. (2) Furthermore, while supervised deblurring methods require a pair of input-blurred and output-deblurred images, it is not easy to collect the deblurred AVM image. This paper proposes a method for generating the pairs of training images that cope with the aforementioned two problems. These training images are generated by the inverse transformation of the AVM image generation process. Experimental results show that our method can suppress blur on AVM images. We also confirmed that even a very shallow CNN with the inference time of 2.1ms has the same performance as the SoTA model.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125652588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Japanese Sentence Dataset for Lip- reading","authors":"Tatsuya Shirakata, T. Saitoh","doi":"10.23919/MVA51890.2021.9511353","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511353","url":null,"abstract":"This research is about lip-reading for Japanese sentences. Research on English sentences is actively pursued due to the extensive datasets. However, a sufficient dataset for Japanese sentences has not been released. Therefore, this paper builds a Japanese sentence dataset. A Transformer model is used for the recognition task. Three recognition target levels: phoneme, mora, and vowel, are set, and recognition experiments show that they can be recognized.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131674746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-physical and Temporal Feature Based Self-correcting Approximation Model for Monocular 3D Volleyball Trajectory Analysis","authors":"J. Dong, Xina Cheng, T. Ikenaga","doi":"10.23919/MVA51890.2021.9511408","DOIUrl":"https://doi.org/10.23919/MVA51890.2021.9511408","url":null,"abstract":"Benefiting from the low venue requirements and deployment cost, analysis of 3D volleyball trajectory from monocular vision sensor is of important significance to volleyball game analysis and training assisting. Because of the monocular vision limitation, complicated ball trajectory caused by physical factors and model drifting owing to distance information loss are two governing challenges. This paper proposes a multi-physical factors and self-cor-recting trajectory approximation model. Also, a trajectory correction algorithm based on temporal motion features is proposed. For the first challenge, air resistance factor and gravity factor which mostly impact volleyball during flying are considered to simulate ball motion status. The approximation model parameters are evaluated and corrected during model calculating to reduce calculation error. To limiting model drifting, volleyball movement characteristics based on temporal motion feature is applied to correct approximated trajectory. The success rate of proposed monocular 3D trajectory approximation method achieves 82.5% which has 47.0% improvement comparing with conventional work.","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133851754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of MVA 2021 17th International Conference on Machine Vision Applications","authors":"","doi":"10.23919/mva51890.2021.9511373","DOIUrl":"https://doi.org/10.23919/mva51890.2021.9511373","url":null,"abstract":"","PeriodicalId":312481,"journal":{"name":"2021 17th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}