{"title":"使用深度学习的跌倒检测,从视频帧的递归二次分割中计算特征","authors":"Zahra Solatidehkordi, Tamer Shanableh","doi":"10.1016/j.imavis.2025.105749","DOIUrl":null,"url":null,"abstract":"<div><div>Accidental falls are a leading cause of injury and death worldwide, particularly among the elderly. Despite extensive research on fall detection, many existing systems remain limited by reliance on wearable sensors that are inconvenient for continuous use, or vision-based approaches that require full video decoding, human pose estimation, or simplified datasets that fail to capture the complexity of real-life environments. As a result, their accuracy often deteriorates in realistic scenarios such as nursing homes or crowded public spaces. In this paper, we introduce a novel fall detection framework that leverages information embedded in the High Efficiency Video Coding (HEVC) standard. Unlike traditional vision-based methods, our approach extracts spatio-temporal features directly from recursive block splits and other HEVC coding information. This includes creating a sequence of four RGB input images which capture block sizes and splits of the video frames in a visual manner. The block sizes in video coding are determined based on the spatio-temporal activities in the frames, hence the suitability of using them as features. Other features are also derived from the coded videos, including compression modes, motion vectors, and prediction residuals. To enhance robustness, we integrate these features into deep learning models and employ fusion strategies that combine complementary representations. Extensive evaluations on two challenging datasets: the Real-World Fall Dataset (RFDS) and the High-Quality Fall Simulation Dataset (HQFSD), demonstrate that our method achieves superior accuracy and robustness compared to prior work. In addition, our method requires only around 23 GFLOPs per video because the deep learning network is executed on just four fixed-frame representations, whereas traditional pipelines process every frame individually, often amounting to hundreds of frames per video and orders of magnitude higher FLOPs.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105749"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fall detection using deep learning with features computed from recursive quadratic splits of video frames\",\"authors\":\"Zahra Solatidehkordi, Tamer Shanableh\",\"doi\":\"10.1016/j.imavis.2025.105749\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accidental falls are a leading cause of injury and death worldwide, particularly among the elderly. Despite extensive research on fall detection, many existing systems remain limited by reliance on wearable sensors that are inconvenient for continuous use, or vision-based approaches that require full video decoding, human pose estimation, or simplified datasets that fail to capture the complexity of real-life environments. As a result, their accuracy often deteriorates in realistic scenarios such as nursing homes or crowded public spaces. In this paper, we introduce a novel fall detection framework that leverages information embedded in the High Efficiency Video Coding (HEVC) standard. Unlike traditional vision-based methods, our approach extracts spatio-temporal features directly from recursive block splits and other HEVC coding information. This includes creating a sequence of four RGB input images which capture block sizes and splits of the video frames in a visual manner. The block sizes in video coding are determined based on the spatio-temporal activities in the frames, hence the suitability of using them as features. Other features are also derived from the coded videos, including compression modes, motion vectors, and prediction residuals. To enhance robustness, we integrate these features into deep learning models and employ fusion strategies that combine complementary representations. Extensive evaluations on two challenging datasets: the Real-World Fall Dataset (RFDS) and the High-Quality Fall Simulation Dataset (HQFSD), demonstrate that our method achieves superior accuracy and robustness compared to prior work. In addition, our method requires only around 23 GFLOPs per video because the deep learning network is executed on just four fixed-frame representations, whereas traditional pipelines process every frame individually, often amounting to hundreds of frames per video and orders of magnitude higher FLOPs.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"163 \",\"pages\":\"Article 105749\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625003373\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003373","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Fall detection using deep learning with features computed from recursive quadratic splits of video frames
Accidental falls are a leading cause of injury and death worldwide, particularly among the elderly. Despite extensive research on fall detection, many existing systems remain limited by reliance on wearable sensors that are inconvenient for continuous use, or vision-based approaches that require full video decoding, human pose estimation, or simplified datasets that fail to capture the complexity of real-life environments. As a result, their accuracy often deteriorates in realistic scenarios such as nursing homes or crowded public spaces. In this paper, we introduce a novel fall detection framework that leverages information embedded in the High Efficiency Video Coding (HEVC) standard. Unlike traditional vision-based methods, our approach extracts spatio-temporal features directly from recursive block splits and other HEVC coding information. This includes creating a sequence of four RGB input images which capture block sizes and splits of the video frames in a visual manner. The block sizes in video coding are determined based on the spatio-temporal activities in the frames, hence the suitability of using them as features. Other features are also derived from the coded videos, including compression modes, motion vectors, and prediction residuals. To enhance robustness, we integrate these features into deep learning models and employ fusion strategies that combine complementary representations. Extensive evaluations on two challenging datasets: the Real-World Fall Dataset (RFDS) and the High-Quality Fall Simulation Dataset (HQFSD), demonstrate that our method achieves superior accuracy and robustness compared to prior work. In addition, our method requires only around 23 GFLOPs per video because the deep learning network is executed on just four fixed-frame representations, whereas traditional pipelines process every frame individually, often amounting to hundreds of frames per video and orders of magnitude higher FLOPs.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.