基于不同深度学习架构的可穿戴设备和单摄像头视频上肢实验室外活动识别的比较研究

Mario Martínez Zarzuela, David González-Ortega, Míriam Antón-Rodríguez, Francisco Javier Díaz-Pernas, Henning Müller, Cristina Simón-Martínez
{"title":"基于不同深度学习架构的可穿戴设备和单摄像头视频上肢实验室外活动识别的比较研究","authors":"Mario Martínez Zarzuela, David González-Ortega, Míriam Antón-Rodríguez, Francisco Javier Díaz-Pernas, Henning Müller, Cristina Simón-Martínez","doi":"10.1016/j.gaitpost.2023.07.149","DOIUrl":null,"url":null,"abstract":"The use of a wide range of computer vision solutions, and more recently high-end Inertial Measurement Units (IMU) have become increasingly popular for assessing human physical activity in clinical and research settings [1]. Nevertheless, to increase the feasibility of patient tracking in out-of-the-lab settings, it is necessary to use a reduced number of devices for movement acquisition. Promising solutions in this context are IMU-based wearables and single camera systems [2]. Additionally, the development of machine learning systems able to recognize and digest clinically relevant data in-the-wild is needed, and therefore determining the ideal input to those is crucial [3]. For upper-limb activity recognition out-of-the-lab, do wearables or single camera offer better performance? Recordings from 16 healthy subjects performing 8 upper-limb activities from the VIDIMU dataset [4] were used. For wearable recordings, the subjects wore 5 IMU-based wearables and adopted a neutral pose (N-pose) for calibration. Joint angles were estimated with inverse kinematics algorithms in OpenSense [5]. Single-camera video recordings occurred simultaneously. Joint angles were estimated with inverse kinematics algorithms in OpenSense. Single-camera video recordings occurred simultaneously, and the subject’s pose was estimated with DeepStream [6]. We compared various Deep Learning architectures (DNN, CNN, CNN-LSTM, LSTM-CNN, LSTM, LSTM-AE) for recognizing daily living activities. The input to the different neural architectures consisted in a 2-second time series containing the estimated joint angles and their 2D FFT. Every network was trained using 2 subjects for validation, a batch size of 20, Adam as the optimizer, and combining early stopping and other regularization techniques. Performance metrics were extracted from 4-fold cross-validation experiments. In all neural networks, performance was higher with IMU-based wearables data compared to video. The best network was an LSTM AutoEncoder (6 layers, 700 K parameters; wearable data accuracy:0.985, F1-score:0.936 (Fig. 1); video data accuracy:0.962, F1-score:0.842). Remarkably, when using video as input there were no significant differences in the performance metrics obtained among different architectures. On the contrary, the F1 scores using IMU data varied significantly (DNN: 0.849, CNN: 0.889, CNN-LSTM: 0.879, LSTM-CNN: 0.904, LSTM: 0.920, LSTM-AE: 0.936).Download : Download high-res image (108KB)Download : Download full-size image Wearables and video present advantages and disadvantages. While IMUs can provide accurate information about the orientation and acceleration of body parts, body-to-segment calibration and drift can affect data reliability. Similarly, a single camera can easily track the position of different body joints, but the recorded data does not yet reliably represent the movement with all degrees of freedom. Our experiments confirm that despite the current limitations of wearables, with a very simple N-pose calibration, IMU data provides more discriminative features for upper-limb activity recognition. Our results are consistent with previous studies that have shown the advantages of IMUs for movement recognition [7]. In the future, we will estimate how these data compare to gold-standard systems.","PeriodicalId":94018,"journal":{"name":"Gait & posture","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study on wearables and single-camera video for upper-limb out-of-the-lab activity recognition with different deep learning architectures\",\"authors\":\"Mario Martínez Zarzuela, David González-Ortega, Míriam Antón-Rodríguez, Francisco Javier Díaz-Pernas, Henning Müller, Cristina Simón-Martínez\",\"doi\":\"10.1016/j.gaitpost.2023.07.149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The use of a wide range of computer vision solutions, and more recently high-end Inertial Measurement Units (IMU) have become increasingly popular for assessing human physical activity in clinical and research settings [1]. Nevertheless, to increase the feasibility of patient tracking in out-of-the-lab settings, it is necessary to use a reduced number of devices for movement acquisition. Promising solutions in this context are IMU-based wearables and single camera systems [2]. Additionally, the development of machine learning systems able to recognize and digest clinically relevant data in-the-wild is needed, and therefore determining the ideal input to those is crucial [3]. For upper-limb activity recognition out-of-the-lab, do wearables or single camera offer better performance? Recordings from 16 healthy subjects performing 8 upper-limb activities from the VIDIMU dataset [4] were used. For wearable recordings, the subjects wore 5 IMU-based wearables and adopted a neutral pose (N-pose) for calibration. Joint angles were estimated with inverse kinematics algorithms in OpenSense [5]. Single-camera video recordings occurred simultaneously. Joint angles were estimated with inverse kinematics algorithms in OpenSense. Single-camera video recordings occurred simultaneously, and the subject’s pose was estimated with DeepStream [6]. We compared various Deep Learning architectures (DNN, CNN, CNN-LSTM, LSTM-CNN, LSTM, LSTM-AE) for recognizing daily living activities. The input to the different neural architectures consisted in a 2-second time series containing the estimated joint angles and their 2D FFT. Every network was trained using 2 subjects for validation, a batch size of 20, Adam as the optimizer, and combining early stopping and other regularization techniques. Performance metrics were extracted from 4-fold cross-validation experiments. In all neural networks, performance was higher with IMU-based wearables data compared to video. The best network was an LSTM AutoEncoder (6 layers, 700 K parameters; wearable data accuracy:0.985, F1-score:0.936 (Fig. 1); video data accuracy:0.962, F1-score:0.842). Remarkably, when using video as input there were no significant differences in the performance metrics obtained among different architectures. On the contrary, the F1 scores using IMU data varied significantly (DNN: 0.849, CNN: 0.889, CNN-LSTM: 0.879, LSTM-CNN: 0.904, LSTM: 0.920, LSTM-AE: 0.936).Download : Download high-res image (108KB)Download : Download full-size image Wearables and video present advantages and disadvantages. While IMUs can provide accurate information about the orientation and acceleration of body parts, body-to-segment calibration and drift can affect data reliability. Similarly, a single camera can easily track the position of different body joints, but the recorded data does not yet reliably represent the movement with all degrees of freedom. Our experiments confirm that despite the current limitations of wearables, with a very simple N-pose calibration, IMU data provides more discriminative features for upper-limb activity recognition. Our results are consistent with previous studies that have shown the advantages of IMUs for movement recognition [7]. In the future, we will estimate how these data compare to gold-standard systems.\",\"PeriodicalId\":94018,\"journal\":{\"name\":\"Gait & posture\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Gait & posture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.gaitpost.2023.07.149\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gait & posture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.gaitpost.2023.07.149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

广泛使用计算机视觉解决方案,以及最近高端惯性测量单元(IMU)在临床和研究环境中越来越受欢迎,用于评估人类的身体活动[1]。然而,为了增加在实验室外环境中患者跟踪的可行性,有必要减少运动采集设备的数量。在这种情况下,有希望的解决方案是基于imu的可穿戴设备和单摄像头系统[2]。此外,需要开发能够识别和消化临床相关数据的机器学习系统,因此确定这些数据的理想输入是至关重要的[3]。对于实验室外的上肢活动识别,可穿戴设备或单摄像头是否能提供更好的性能?使用来自VIDIMU数据集[4]的16名健康受试者进行8项上肢活动的录音。对于可穿戴记录,受试者佩戴5个基于imu的可穿戴设备,并采用中性姿势(N-pose)进行校准。在OpenSense中使用逆运动学算法估计关节角度[5]。单摄像机录像同时发生。在OpenSense中使用逆运动学算法估计关节角。单摄像机录像同时进行,使用DeepStream估计受试者的姿势[6]。我们比较了用于识别日常生活活动的各种深度学习架构(DNN、CNN、CNN-LSTM、LSTM-CNN、LSTM、LSTM- ae)。不同神经结构的输入由包含估计关节角度及其二维FFT的2秒时间序列组成。每个网络使用2个对象进行验证,批大小为20,Adam作为优化器,并结合早期停止和其他正则化技术进行训练。性能指标是从4次交叉验证实验中提取的。在所有神经网络中,与视频相比,基于imu的可穿戴设备数据的性能更高。最佳网络是LSTM AutoEncoder(6层,700 K参数);可穿戴数据精度:0.985,F1-score:0.936(图1);视频数据精度:0.962,F1-score:0.842)。值得注意的是,当使用视频作为输入时,不同架构之间获得的性能指标没有显着差异。与此相反,IMU数据的F1得分差异显著(DNN: 0.849, CNN: 0.889, CNN-LSTM: 0.879, LSTM-CNN: 0.904, LSTM: 0.920, LSTM- ae: 0.936)。下载:下载高清图片(108KB)下载:下载全尺寸图片可穿戴设备和视频呈现优缺点。虽然imu可以提供有关车身部件方向和加速度的准确信息,但车身到车身的校准和漂移可能会影响数据的可靠性。同样,单个摄像机可以很容易地跟踪不同身体关节的位置,但记录的数据还不能可靠地代表所有自由度的运动。我们的实验证实,尽管目前可穿戴设备存在局限性,但通过非常简单的n位校准,IMU数据为上肢活动识别提供了更多的判别特征。我们的结果与先前的研究一致,这些研究表明imu在运动识别方面具有优势[7]。在未来,我们将评估这些数据与金标准系统的比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A comparative study on wearables and single-camera video for upper-limb out-of-the-lab activity recognition with different deep learning architectures
The use of a wide range of computer vision solutions, and more recently high-end Inertial Measurement Units (IMU) have become increasingly popular for assessing human physical activity in clinical and research settings [1]. Nevertheless, to increase the feasibility of patient tracking in out-of-the-lab settings, it is necessary to use a reduced number of devices for movement acquisition. Promising solutions in this context are IMU-based wearables and single camera systems [2]. Additionally, the development of machine learning systems able to recognize and digest clinically relevant data in-the-wild is needed, and therefore determining the ideal input to those is crucial [3]. For upper-limb activity recognition out-of-the-lab, do wearables or single camera offer better performance? Recordings from 16 healthy subjects performing 8 upper-limb activities from the VIDIMU dataset [4] were used. For wearable recordings, the subjects wore 5 IMU-based wearables and adopted a neutral pose (N-pose) for calibration. Joint angles were estimated with inverse kinematics algorithms in OpenSense [5]. Single-camera video recordings occurred simultaneously. Joint angles were estimated with inverse kinematics algorithms in OpenSense. Single-camera video recordings occurred simultaneously, and the subject’s pose was estimated with DeepStream [6]. We compared various Deep Learning architectures (DNN, CNN, CNN-LSTM, LSTM-CNN, LSTM, LSTM-AE) for recognizing daily living activities. The input to the different neural architectures consisted in a 2-second time series containing the estimated joint angles and their 2D FFT. Every network was trained using 2 subjects for validation, a batch size of 20, Adam as the optimizer, and combining early stopping and other regularization techniques. Performance metrics were extracted from 4-fold cross-validation experiments. In all neural networks, performance was higher with IMU-based wearables data compared to video. The best network was an LSTM AutoEncoder (6 layers, 700 K parameters; wearable data accuracy:0.985, F1-score:0.936 (Fig. 1); video data accuracy:0.962, F1-score:0.842). Remarkably, when using video as input there were no significant differences in the performance metrics obtained among different architectures. On the contrary, the F1 scores using IMU data varied significantly (DNN: 0.849, CNN: 0.889, CNN-LSTM: 0.879, LSTM-CNN: 0.904, LSTM: 0.920, LSTM-AE: 0.936).Download : Download high-res image (108KB)Download : Download full-size image Wearables and video present advantages and disadvantages. While IMUs can provide accurate information about the orientation and acceleration of body parts, body-to-segment calibration and drift can affect data reliability. Similarly, a single camera can easily track the position of different body joints, but the recorded data does not yet reliably represent the movement with all degrees of freedom. Our experiments confirm that despite the current limitations of wearables, with a very simple N-pose calibration, IMU data provides more discriminative features for upper-limb activity recognition. Our results are consistent with previous studies that have shown the advantages of IMUs for movement recognition [7]. In the future, we will estimate how these data compare to gold-standard systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信