Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild最新文献

筛选

英文中文

Explore Multi-Step Reasoning in Video Question Answering 探索视频问答中的多步骤推理

Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild Pub Date : 2018-10-15 DOI: 10.1145/3265987.3265996

Yahong Han

{"title":"Explore Multi-Step Reasoning in Video Question Answering","authors":"Yahong Han","doi":"10.1145/3265987.3265996","DOIUrl":"https://doi.org/10.1145/3265987.3265996","url":null,"abstract":"This invited talk is a repeated but more detailed talk about the paper which is accepted by ACM-MM 2018: Video question answering (VideoQA) always involves visual reasoning. When answering questions composing of multiple logic correlations, models need to perform multi-step reasoning. In this paper, we formulate multi-step reasoning in VideoQA as a new task to answer compositional and logical structured questions based on video content. Existing VideoQA datasets are inadequate as benchmarks for the multi-step reasoning due to limitations as lacking logical structure and having language biases. Thus we design a system to automatically generate a large-scale dataset, namely SVQA (Synthetic Video Question Answering). Compared with other VideoQA datasets, SVQA contains exclusively long and structured questions with various spatial and temporal relations between objects. More importantly, questions in SVQA can be decomposed into human readable logical tree or chain layouts, each node of which represents a sub-task requiring a reasoning operation such as comparison or arithmetic. Towards automatic question answering in SVQA, we develop a new VideoQA model. Particularly, we construct a new attention module, which contains spatial attention mechanism to address crucial and multiple logical sub-tasks embedded in questions, as well as a refined GRU called ta-GRU (temporal-attention GRU) to capture the long-term temporal dependency and gather complete visual cues. Experimental results show the capability of multi-step reasoning of SVQA and the effectiveness of our model when compared with other existing models.","PeriodicalId":151362,"journal":{"name":"Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133038777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Multi-task Joint Learning for Videos in the Wild 野外视频多任务联合学习

Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild Pub Date : 2018-10-15 DOI: 10.1145/3265987.3265988

Yongwon Hong, Hoseong Kim, H. Byun

{"title":"Multi-task Joint Learning for Videos in the Wild","authors":"Yongwon Hong, Hoseong Kim, H. Byun","doi":"10.1145/3265987.3265988","DOIUrl":"https://doi.org/10.1145/3265987.3265988","url":null,"abstract":"Most of the conventional state-of-the-art methods for video analysis achieve outstanding performance by combining two or more different inputs, e.g. an RGB image, a motion image, or an audio signal, in a two-stream manner. Although these approaches generate pronounced performance, it underlines that each considered feature is tantamount in the classification of the video. This dilutes the nature of each class that every class depends on the different levels of information from different features. To incorporate the nature of each class, we present the class nature specific fusion that combines the features with a different level of weights for the optimal class result. In this work, we first represent each frame-level video feature as a spectral image to train convolutional neural networks (CNNs) on the RGB and audio features. We then revise the conventional two-stream fusion method to form a class nature specific one by combining features in different weight for different classes. We evaluate our method on the Comprehensive Video Understanding in the Wild dataset to understand how each class reacted on each feature in wild videos. Our experimental results not only show the advantage over conventional two-stream fusion, but also illustrate the correlation of two features: RGB and audio signal for each class.","PeriodicalId":151362,"journal":{"name":"Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页