End-To-End Part-Level Action Parsing With Transformer

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00135

Xiaojia Chen, Xuanhan Wang, Beitao Chen, Lianli Gao

{"title":"End-To-End Part-Level Action Parsing With Transformer","authors":"Xiaojia Chen, Xuanhan Wang, Beitao Chen, Lianli Gao","doi":"10.1109/ICME55011.2023.00135","DOIUrl":null,"url":null,"abstract":"The divide-and-conquer strategy, which interprets part-level action parsing as a detect-then-parsing pipeline, has been widely used and become a general tool for part-level action understanding. However, existing methods that derive from the strategy usually suffer from either strong dependence on prior detection or high computational complexity. In this paper, we present the first fully end-to-end part-level action parsing framework with transformers, termed PATR. Unlike existing methods, our method regards part-level action parsing as a hierarchical set prediction problem and unifies person detection, body part detection, and action state recognition into one model. In PATR, predefined learnable representations, including general instance representations and general part representations, are guided to adaptively attend to the image features that are relevant to target body parts. Then, conditioning on corresponding learnable representations, attended image features are hierarchically decoded into corresponding semantics (i.e., person location, body part location, and action states for each body part). In this way, PATR relies on characteristics of body parts, instead of prior predictions like bounding boxes, to parse action states, thus removing the strong dependence between sub-tasks and eliminating the computational burdens caused by the multi-stage paradigm. Extensive experiments conducted on challenging Kinetic-TPS indicate that our method achieves very competitive results. In particular, our model outperforms all state-of-the-art part-level action parsing approaches by a margin, reaching around 3.8±2.0% Accp higher than previous methods. These findings indicate the potential of PATR to serve as a new baseline for part-level action parsing methods in the future. Our code and models are publicly available. 1","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The divide-and-conquer strategy, which interprets part-level action parsing as a detect-then-parsing pipeline, has been widely used and become a general tool for part-level action understanding. However, existing methods that derive from the strategy usually suffer from either strong dependence on prior detection or high computational complexity. In this paper, we present the first fully end-to-end part-level action parsing framework with transformers, termed PATR. Unlike existing methods, our method regards part-level action parsing as a hierarchical set prediction problem and unifies person detection, body part detection, and action state recognition into one model. In PATR, predefined learnable representations, including general instance representations and general part representations, are guided to adaptively attend to the image features that are relevant to target body parts. Then, conditioning on corresponding learnable representations, attended image features are hierarchically decoded into corresponding semantics (i.e., person location, body part location, and action states for each body part). In this way, PATR relies on characteristics of body parts, instead of prior predictions like bounding boxes, to parse action states, thus removing the strong dependence between sub-tasks and eliminating the computational burdens caused by the multi-stage paradigm. Extensive experiments conducted on challenging Kinetic-TPS indicate that our method achieves very competitive results. In particular, our model outperforms all state-of-the-art part-level action parsing approaches by a margin, reaching around 3.8±2.0% Accp higher than previous methods. These findings indicate the potential of PATR to serve as a new baseline for part-level action parsing methods in the future. Our code and models are publicly available. 1

查看原文本刊更多论文

端到端的部分级动作解析与变压器

分而治之策略将部分级动作解析解释为先检测后解析的管道，已被广泛使用，并成为部分级动作理解的通用工具。然而，现有的基于该策略的方法要么对先验检测的依赖性强，要么计算复杂度高。在本文中，我们提出了第一个完全端到端的部分级动作解析框架，称为PATR。与现有方法不同，我们的方法将部分级动作解析作为一个层次集预测问题，并将人检测、身体部位检测和动作状态识别统一到一个模型中。在PATR中，预定义的可学习表征，包括一般实例表征和一般部位表征，被引导自适应地关注与目标身体部位相关的图像特征。然后，在相应的可学习表征的条件下，参与的图像特征被分层地解码为相应的语义(即人的位置、身体部位的位置和每个身体部位的动作状态)。这样，PATR依靠身体部位的特征来解析动作状态，而不是像边界盒那样的先验预测，从而消除了子任务之间的强依赖性，消除了多阶段范式带来的计算负担。在具有挑战性的Kinetic-TPS上进行的大量实验表明，我们的方法取得了非常有竞争力的结果。特别是，我们的模型比所有最先进的部分级动作解析方法都要好，比以前的方法高3.8±2.0% Accp。这些发现表明，PATR有可能在未来作为部分级动作解析方法的新基线。我们的代码和模型是公开的。1

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量