How to Make a BLT Sandwich? Learning VQA towards Understanding Web Instructional Videos

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI:10.1109/WACV48630.2021.00117

Shaojie Wang, Wentian Zhao, Ziyi Kou, Jing Shi, Chenliang Xu

{"title":"How to Make a BLT Sandwich? Learning VQA towards Understanding Web Instructional Videos","authors":"Shaojie Wang, Wentian Zhao, Ziyi Kou, Jing Shi, Chenliang Xu","doi":"10.1109/WACV48630.2021.00117","DOIUrl":null,"url":null,"abstract":"Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedures constraining the understanding task. In this work, we study problem-solving on instructional videos via Visual Question Answering (VQA). Surprisingly, it has not been an emphasis for the video community despite its rich applications. We thereby introduce YouCookQA, an annotated QA dataset for instructional videos based on YouCook2 [27]. The questions in YouCookQA are not limited to cues on a single frame but relations among multiple frames in the temporal dimension. Observing the lack of effective representations for modeling long videos, we propose a set of carefully designed models including a Recurrent Graph Convolutional Network (RGCN) that captures both temporal order and relational information. Furthermore, we study multiple modalities including descriptions and transcripts for the purpose of boosting video understanding. Extensive experiments on YouCookQA suggest that RGCN performs the best in terms of QA accuracy and better performance is gained by introducing human-annotated descriptions. YouCookQA dataset is available at https://github.com/Jossome/YoucookQA.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV48630.2021.00117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedures constraining the understanding task. In this work, we study problem-solving on instructional videos via Visual Question Answering (VQA). Surprisingly, it has not been an emphasis for the video community despite its rich applications. We thereby introduce YouCookQA, an annotated QA dataset for instructional videos based on YouCook2 [27]. The questions in YouCookQA are not limited to cues on a single frame but relations among multiple frames in the temporal dimension. Observing the lack of effective representations for modeling long videos, we propose a set of carefully designed models including a Recurrent Graph Convolutional Network (RGCN) that captures both temporal order and relational information. Furthermore, we study multiple modalities including descriptions and transcripts for the purpose of boosting video understanding. Extensive experiments on YouCookQA suggest that RGCN performs the best in terms of QA accuracy and better performance is gained by introducing human-annotated descriptions. YouCookQA dataset is available at https://github.com/Jossome/YoucookQA.

查看原文本刊更多论文

如何制作BLT三明治?学习VQA以理解网络教学视频

理解网络教学视频是视频理解的一个重要分支，主要体现在两个方面。首先，大多数现有的视频方法侧重于几秒钟长的视频剪辑的短期动作;这些方法并不直接适用于长视频。其次，与无约束的长视频(如电影)不同，教学视频更结构化，因为它们有一步一步的过程来约束理解任务。在这项工作中，我们研究了通过视觉问答(VQA)来解决教学视频问题。令人惊讶的是，尽管它有丰富的应用程序，但它并没有成为视频社区的重点。因此，我们介绍了YouCookQA，一个基于youcook2b[27]的教学视频注释QA数据集。YouCookQA中的问题并不局限于单个帧上的线索，而是在时间维度上多个帧之间的关系。观察到长视频建模缺乏有效的表示，我们提出了一组精心设计的模型，包括一个循环图卷积网络(RGCN)，它可以捕获时间顺序和关系信息。此外，我们研究了多种模式，包括描述和转录，以提高视频理解的目的。在YouCookQA上的大量实验表明，RGCN在QA准确性方面表现最好，并且通过引入人工注释的描述获得了更好的性能。YouCookQA数据集可在https://github.com/Jossome/YoucookQA上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量