{"title":"MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention","authors":"Changchen Zhao , Hongsheng Wang , Yuanjing Feng","doi":"10.1016/j.vrih.2022.07.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling.</p></div><div><h3>Methods</h3><p>This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features.</p></div><div><h3>Results</h3><p>The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms.</p></div><div><h3>Conclusions</h3><p>Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.</p></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"5 2","pages":"Pages 124-141"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2096579622000626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2
Abstract
Background
Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling.
Methods
This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features.
Results
The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms.
Conclusions
Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.