An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference Pub Date : 2022-10-10 DOI:10.48550/arXiv.2210.04933

Kiyoon Kim, D. Moltisanti, Oisin Mac Aodha, Laura Sevilla-Lara

{"title":"An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition","authors":"Kiyoon Kim, D. Moltisanti, Oisin Mac Aodha, Laura Sevilla-Lara","doi":"10.48550/arXiv.2210.04933","DOIUrl":null,"url":null,"abstract":"Precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g. dog, cat, chair, etc.), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g. jogging versus running). In practice, a given video can contain multiple valid positive annotations for the same action. As a result, video datasets often contain significant levels of label noise and overlap between the atomic action classes. In this work, we address the challenge of training multi-label action recognition models from only single positive training labels. We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set. Unlike other approaches that use model-derived pseudo-labels, our pseudo-labels come from human annotations and are selected based on feature similarity. To validate our approaches, we create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels. We present results on this new test set along with additional results on a new version of HMDB-51, called Confusing-HMDB-102, where we outperform existing methods in both cases. Data and code are available at https://github.com/kiyoon/verb_ambiguity","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"24 1","pages":"356"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.04933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g. dog, cat, chair, etc.), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g. jogging versus running). In practice, a given video can contain multiple valid positive annotations for the same action. As a result, video datasets often contain significant levels of label noise and overlap between the atomic action classes. In this work, we address the challenge of training multi-label action recognition models from only single positive training labels. We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set. Unlike other approaches that use model-derived pseudo-labels, our pseudo-labels come from human annotations and are selected based on feature similarity. To validate our approaches, we create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels. We present results on this new test set along with additional results on a new version of HMDB-51, called Confusing-HMDB-102, where we outperform existing methods in both cases. Data and code are available at https://github.com/kiyoon/verb_ambiguity

查看原文本刊更多论文

一个动作值多个词:动作识别中的歧义处理

准确地命名视频中描述的动作可能是一项具有挑战性的任务，而且往往是模棱两可的任务。与表示为名词的对象实例(如狗、猫、椅子等)相反，在动作的情况下，人类注释者通常缺乏关于什么构成特定动作的共识(如慢跑与跑步)。在实践中，给定的视频可以包含针对同一动作的多个有效的正面注释。因此，视频数据集通常包含显著水平的标签噪声和原子动作类之间的重叠。在这项工作中，我们解决了仅从单个正训练标签训练多标签动作识别模型的挑战。我们提出了两种基于从训练集中的相似实例中抽样生成伪训练样例的方法。与其他使用模型派生伪标签的方法不同，我们的伪标签来自人类注释，并基于特征相似性进行选择。为了验证我们的方法，我们创建了一个新的评估基准，方法是用多个动词标签手动注释EPIC-Kitchens-100验证集的一个子集。我们在这个新测试集上展示了结果，并在名为confusion - hmb -102的新版本上展示了额外的结果，我们在这两种情况下都优于现有的方法。数据和代码可在https://github.com/kiyoon/verb_ambiguity上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference

自引率

0.00%

发文量