Adapting RGB Pose Estimation to New Domains

2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC) Pub Date : 2019-06-14 DOI:10.1109/CCWC.2019.8666594

Gururaj Mulay, B. Draper, J. Beveridge

{"title":"Adapting RGB Pose Estimation to New Domains","authors":"Gururaj Mulay, B. Draper, J. Beveridge","doi":"10.1109/CCWC.2019.8666594","DOIUrl":null,"url":null,"abstract":"Many multi-modal human computer interaction (HCI) systems interact with users in real-time by estimating the user’s pose. Generally, they estimate human poses using depth sensors such as the Microsoft Kinect. For multi-modal HCI interfaces to gain traction in the real world, however, it would be better for pose estimation to be based on data from RGB cameras, which are more common and less expensive than depth sensors. This has motivated research into pose estimation from RGB images. Convolutional Neural Networks (CNNs) represent the state-of-the-art in this literature, for example [1], [2], [9], [13], [14], and [15]. These systems estimate 2D human poses from RGB images. A problem with current CNN-based pose estimators is that they require large amounts of labeled data for training. If the goal is to train an RGB pose estimator for a new domain, the cost of collecting and more importantly labeling data can be prohibitive. A common solution is to train on publicly available pose data sets, but then the trained system is not tailored to the domain. We propose using RGB+D sensors to collect domain-specific data in the lab, and then training the RGB pose estimator using skeletons automatically extracted from the RGB+D data. This paper presents a case study of adapting the RMPE pose estimation network [2] to the domain of the DARPA Communicating with Computers (CWC) program [3], as represented by the EGGNOG data set [8]. We chose RMPE because it predicts both joint locations and Part Affinity Fields (PAFs) in real-time. Our adaptation of RMPE trained on automatically-labeled data outperforms the original RMPE on the EGGNOG data set.","PeriodicalId":132812,"journal":{"name":"2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC)","volume":"426 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCWC.2019.8666594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Many multi-modal human computer interaction (HCI) systems interact with users in real-time by estimating the user’s pose. Generally, they estimate human poses using depth sensors such as the Microsoft Kinect. For multi-modal HCI interfaces to gain traction in the real world, however, it would be better for pose estimation to be based on data from RGB cameras, which are more common and less expensive than depth sensors. This has motivated research into pose estimation from RGB images. Convolutional Neural Networks (CNNs) represent the state-of-the-art in this literature, for example [1], [2], [9], [13], [14], and [15]. These systems estimate 2D human poses from RGB images. A problem with current CNN-based pose estimators is that they require large amounts of labeled data for training. If the goal is to train an RGB pose estimator for a new domain, the cost of collecting and more importantly labeling data can be prohibitive. A common solution is to train on publicly available pose data sets, but then the trained system is not tailored to the domain. We propose using RGB+D sensors to collect domain-specific data in the lab, and then training the RGB pose estimator using skeletons automatically extracted from the RGB+D data. This paper presents a case study of adapting the RMPE pose estimation network [2] to the domain of the DARPA Communicating with Computers (CWC) program [3], as represented by the EGGNOG data set [8]. We chose RMPE because it predicts both joint locations and Part Affinity Fields (PAFs) in real-time. Our adaptation of RMPE trained on automatically-labeled data outperforms the original RMPE on the EGGNOG data set.

查看原文本刊更多论文

将RGB姿态估计应用于新域

许多多模态人机交互(HCI)系统通过估计用户的姿态来实时与用户交互。一般来说，他们使用深度传感器(如微软Kinect)来估计人体姿势。然而，为了让多模态HCI接口在现实世界中获得吸引力，基于RGB相机的数据进行姿态估计会更好，RGB相机比深度传感器更常见，也更便宜。这激发了对RGB图像姿态估计的研究。卷积神经网络(cnn)代表了本文献中最先进的技术，例如[1]，[2]，[9]，[13]，[14]和[15]。这些系统从RGB图像中估计2D人体姿势。目前基于cnn的姿态估计器的一个问题是，它们需要大量的标记数据进行训练。如果目标是训练一个新域的RGB姿态估计器，那么收集和更重要的标记数据的成本可能会令人望而却步。一种常见的解决方案是在公开可用的姿态数据集上进行训练，但训练后的系统并不适合该领域。我们建议在实验室中使用RGB+D传感器收集特定领域的数据，然后使用从RGB+D数据中自动提取的骨架来训练RGB姿态估计器。本文以EGGNOG数据集[8]为例，介绍了将RMPE姿态估计网络[2]应用于DARPA与计算机通信(CWC)计划[3]领域的实例研究。我们选择RMPE是因为它可以实时预测关节位置和零件关联场(paf)。我们在自动标记数据上训练的RMPE的适应性优于EGGNOG数据集上的原始RMPE。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC)

自引率

0.00%

发文量