Contextual visual and motion salient fusion framework for action recognition in dark environments

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2024-09-05 DOI:10.1016/j.knosys.2024.112480

{"title":"Contextual visual and motion salient fusion framework for action recognition in dark environments","authors":"","doi":"10.1016/j.knosys.2024.112480","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared (IR) human action recognition (AR) exhibits resilience against shifting illumination conditions, changes in appearance, and shadows. It has valuable applications in numerous areas of future sustainable and smart cities including robotics, intelligent systems, security, and transportation. However, current IR-based recognition approaches predominantly concentrate on spatial or local temporal information and often overlook the potential value of global temporal patterns. This oversight can lead to incomplete representations of body part movements and prevent accurate optimization of a network. Therefore, a contextual-motion coalescence network (CMCNet) is proposed that operates in a streamlined and end-to-end manner for robust action representation in darkness in a near-infrared (NIR) setting. Initially, data are preprocessed to separate foreground, normalized, and resized. The framework employs two parallel modules: the contextual visual features learning module (CVFLM) for local feature extraction, and the temporal optical flow learning module (TOFLM) for acquiring motion dynamics. These modules focus on action-relevant regions used shift window-based operations to ensure accurate interpretation of motion information. The coalescence block harmoniously integrates the contextual and motion features within a unified framework. Finally, the temporal decoder module discriminatively identifies the boundaries of the action sequence. This sequence of steps ensures the synergistic optimization of both CVFLM and TOFLM and thorough competent feature extraction for precise AR. Evaluations of CMCNet are carried out on publicly available datasets, InfAR and NTURGB-D, where superior performance is achieved. Our model yields the highest average precision of 89% and 85% on these datasets, respectively, representing an improvement of 2.25% (on InfAR) compared to conventional methods operating at spatial and optical flow levels which underscores its efficacy.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124011146","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Infrared (IR) human action recognition (AR) exhibits resilience against shifting illumination conditions, changes in appearance, and shadows. It has valuable applications in numerous areas of future sustainable and smart cities including robotics, intelligent systems, security, and transportation. However, current IR-based recognition approaches predominantly concentrate on spatial or local temporal information and often overlook the potential value of global temporal patterns. This oversight can lead to incomplete representations of body part movements and prevent accurate optimization of a network. Therefore, a contextual-motion coalescence network (CMCNet) is proposed that operates in a streamlined and end-to-end manner for robust action representation in darkness in a near-infrared (NIR) setting. Initially, data are preprocessed to separate foreground, normalized, and resized. The framework employs two parallel modules: the contextual visual features learning module (CVFLM) for local feature extraction, and the temporal optical flow learning module (TOFLM) for acquiring motion dynamics. These modules focus on action-relevant regions used shift window-based operations to ensure accurate interpretation of motion information. The coalescence block harmoniously integrates the contextual and motion features within a unified framework. Finally, the temporal decoder module discriminatively identifies the boundaries of the action sequence. This sequence of steps ensures the synergistic optimization of both CVFLM and TOFLM and thorough competent feature extraction for precise AR. Evaluations of CMCNet are carried out on publicly available datasets, InfAR and NTURGB-D, where superior performance is achieved. Our model yields the highest average precision of 89% and 85% on these datasets, respectively, representing an improvement of 2.25% (on InfAR) compared to conventional methods operating at spatial and optical flow levels which underscores its efficacy.

查看原文本刊更多论文

用于黑暗环境中动作识别的上下文视觉和运动显著性融合框架

红外线（IR）人体动作识别（AR）可抵御光照条件的变化、外观变化和阴影。它在未来可持续发展和智能城市的众多领域都有重要应用，包括机器人、智能系统、安防和交通。然而，目前基于红外的识别方法主要集中在空间或局部时间信息上，往往忽略了全局时间模式的潜在价值。这种疏忽会导致对身体部位运动的表征不完整，并阻碍网络的准确优化。因此，我们提出了一种上下文运动聚合网络（CMCNet），该网络以简化的端到端方式运行，可在黑暗的近红外（NIR）环境中实现稳健的动作表示。首先，对数据进行预处理，以分离前景、归一化并调整大小。该框架采用两个并行模块：用于局部特征提取的上下文视觉特征学习模块（CVFLM）和用于获取运动动态的时序光流学习模块（TOFLM）。这些模块重点关注与动作相关的区域，使用基于移位窗口的操作来确保运动信息的准确解读。凝聚模块将上下文特征和运动特征和谐地整合到一个统一的框架中。最后，时序解码器模块能识别动作序列的边界。这一系列步骤确保了 CVFLM 和 TOFLM 的协同优化，并为精确的 AR 提供了全面的特征提取。CMCNet 在公开数据集 InfAR 和 NTURGB-D 上进行了评估，取得了优异的性能。在这两个数据集上，我们的模型分别获得了 89% 和 85% 的最高平均精度，与传统的空间和光流级别方法相比，精度提高了 2.25%（在 InfAR 上），凸显了其功效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.