Continuous activity understanding based on accumulative pose-context visual patterns

2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA) Pub Date : 2017-11-01 DOI:10.1109/IPTA.2017.8310114

Yan Zhang, Georg Layher, H. Neumann

{"title":"Continuous activity understanding based on accumulative pose-context visual patterns","authors":"Yan Zhang, Georg Layher, H. Neumann","doi":"10.1109/IPTA.2017.8310114","DOIUrl":null,"url":null,"abstract":"In application domains, such as human-robot interaction and ambient intelligence, it is expected that an intelligent agent can respond to the person's actions efficiently or make predictions while the person's activity is still ongoing. In this paper, we investigate the problem of continuous activity understanding, based on a visual pattern extraction mechanism which fuses decomposed body pose features from estimated 2D skeletons (based on deep learning skeleton inference) and localized appearance-motion features around spatiotemporal interest points (STIPs). Considering that human activities are observed and inferred gradually, we partition the video into snippets, extract the visual pattern accumulatively and infer the activities in an online fashion. We evaluated the proposed method on two benchmark datasets and achieved 92.6% on the KTH dataset and 92.7% on the Rochester Assisted Daily Living dataset in the equilibrated inference states. In parallel, we discover that context information mainly contributed by STIPs is probably more favourable to activity recognition than the pose information, especially in scenarios of daily living activities. In addition, incorporating the visual patterns of activities from early stages to train the classifier can improve the performance of early recognition; however, it could degrade the recognition rate in later time. To overcome this issue, we propose a mixture model, where the classifier trained with early visual patterns are used in early stages while the classifier trained without early patterns are used in later stages. The experimental results show that this straightforward approach can improve early recognition while retaining the recognition correctness of later times.","PeriodicalId":316356,"journal":{"name":"2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPTA.2017.8310114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In application domains, such as human-robot interaction and ambient intelligence, it is expected that an intelligent agent can respond to the person's actions efficiently or make predictions while the person's activity is still ongoing. In this paper, we investigate the problem of continuous activity understanding, based on a visual pattern extraction mechanism which fuses decomposed body pose features from estimated 2D skeletons (based on deep learning skeleton inference) and localized appearance-motion features around spatiotemporal interest points (STIPs). Considering that human activities are observed and inferred gradually, we partition the video into snippets, extract the visual pattern accumulatively and infer the activities in an online fashion. We evaluated the proposed method on two benchmark datasets and achieved 92.6% on the KTH dataset and 92.7% on the Rochester Assisted Daily Living dataset in the equilibrated inference states. In parallel, we discover that context information mainly contributed by STIPs is probably more favourable to activity recognition than the pose information, especially in scenarios of daily living activities. In addition, incorporating the visual patterns of activities from early stages to train the classifier can improve the performance of early recognition; however, it could degrade the recognition rate in later time. To overcome this issue, we propose a mixture model, where the classifier trained with early visual patterns are used in early stages while the classifier trained without early patterns are used in later stages. The experimental results show that this straightforward approach can improve early recognition while retaining the recognition correctness of later times.

查看原文本刊更多论文

基于累积的姿势-上下文视觉模式的连续活动理解

在人机交互和环境智能等应用领域，人们期望智能代理能够在人的活动仍在进行时有效地响应人的动作或做出预测。在本文中，我们研究了基于视觉模式提取机制的连续活动理解问题，该机制融合了来自估计的2D骨骼(基于深度学习骨骼推理)分解的身体姿势特征和围绕时空兴趣点(STIPs)的局部外观运动特征。考虑到人类活动是逐渐观察和推断的，我们将视频分割成片段，累积提取视觉模式，并以在线方式推断活动。我们在两个基准数据集上评估了所提出的方法，在平衡推理状态下，KTH数据集和罗切斯特辅助日常生活数据集的准确率分别达到了92.6%和92.7%。同时，我们发现主要由sti提供的上下文信息可能比姿势信息更有利于活动识别，特别是在日常生活活动的场景中。此外，结合早期活动的视觉模式来训练分类器可以提高早期识别的性能;但是，这可能会降低后期的识别率。为了克服这个问题，我们提出了一个混合模型，在早期阶段使用早期视觉模式训练的分类器，而在后期阶段使用没有早期模式训练的分类器。实验结果表明，该方法在保证后期识别正确性的同时，提高了早期的识别效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA)

自引率

0.00%

发文量