ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

Findings (Sydney (N.S.W.) Pub Date : 2022-10-11 DOI:10.48550/arXiv.2210.05556

Terry Yue Zhuo, Yaqing Liao, Yuecheng Lei, Lizhen Qu, Gerard de Melo, Xiaojun Chang, Yazhou Ren, Zenglin Xu

引用次数: 0

Abstract

We introduce {dataset, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from {charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

查看原文本刊更多论文

ViLPAct:多模式人类活动的合成概括基准

我们引入了{dataset，这是一种用于人类活动规划的新型视觉语言基准。它是为一个任务而设计的，在这个任务中，嵌入的人工智能代理可以根据关于人类最初活动和意图的视频片段来推理和预测人类未来的行为。该数据集由2.9k个视频组成，这些视频来自于通过众包进行意图扩展的猜字游戏，一个选择题测试集和四个强大的基线。其中一个基线实现了基于多模态知识库(MKB)的神经符号方法，而其他基线是基于最新技术(SOTA)方法的深度生成模型。根据我们广泛的实验，关键的挑战是组合概化和有效利用两种模式的信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Findings (Sydney (N.S.W.)

自引率

0.00%

发文量

审稿时长

4 weeks