MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention

Q1 Computer Science

Virtual Reality Intelligent Hardware Pub Date : 2023-04-01 DOI:10.1016/j.vrih.2022.07.001

Changchen Zhao , Hongsheng Wang , Yuanjing Feng

{"title":"MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention","authors":"Changchen Zhao , Hongsheng Wang , Yuanjing Feng","doi":"10.1016/j.vrih.2022.07.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling.</p></div><div><h3>Methods</h3><p>This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features.</p></div><div><h3>Results</h3><p>The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms.</p></div><div><h3>Conclusions</h3><p>Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.</p></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"5 2","pages":"Pages 124-141"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2096579622000626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 2

Abstract

Background

Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling.

Methods

This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features.

Results

The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms.

Conclusions

Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.

查看原文本刊更多论文

基于可分时空卷积和维数可分注意力的多尺度面部视频脉冲提取网络

利用远程光容积脉搏波(rPPG)非接触式测量血容量脉搏是近年来研究的热点。现有方法主要基于单尺度感兴趣区域(ROI)。然而，一些在单尺度空间中不易分离的噪声信号在多尺度空间中却很容易分离。此外，现有的时空网络主要关注局部时空信息，缺乏对脉冲提取问题中至关重要的时间信息的重视，导致时空特征建模不足。方法提出了一种基于可分时空卷积和维数可分注意力的多尺度面部视频脉冲提取网络。首先，为了解决单尺度ROI问题，构建多尺度特征空间进行初始信号分离;其次，设计了可分时空卷积和可分维度注意的高效时空关联建模方法，增加了大跨度时空维度之间的信息交互，更加强调时间特征;结果该网络在PURE数据集上的信噪比达到9.58 dB，在UBFC-rPPG数据集上的信噪比达到6.77 dB，优于现有算法。结论多尺度信号融合总体上优于单尺度信号融合。提出的可分时空卷积和可分维注意机制有助于提高脉冲信号的提取精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊