{"title":"基于 LSTM 和变压器的人体运动预测神经网络算法研究","authors":"S. V. Zhiganov, Y. S. Ivanov, D. M. Grabar","doi":"10.1134/S1064562423701624","DOIUrl":null,"url":null,"abstract":"<p>The problem of predicting the position of a person on future frames of a video stream is solved, and in-depth experimental studies on the application of traditional and SOTA blocks for this task are carried out. An original architecture of KeyFNet and its modifications based on transform blocks is presented, which is able to predict coordinates in the video stream for 30, 60, 90, and 120 frames ahead with high accuracy. The novelty lies in the application of a combined algorithm based on multiple FNet blocks with fast Fourier transform as an attention mechanism concatenating the coordinates of key points. Experiments on Human3.6M and on our own real data confirmed the effectiveness of the proposed approach based on FNet blocks, compared to the traditional approach based on LSTM. The proposed algorithm matches the accuracy of advanced models, but outperforms them in terms of speed, uses less computational resources, and thus can be applied in collaborative robotic solutions.</p>","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigation of Neural Network Algorithms for Human Movement Prediction Based on LSTM and Transformers\",\"authors\":\"S. V. Zhiganov, Y. S. Ivanov, D. M. Grabar\",\"doi\":\"10.1134/S1064562423701624\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The problem of predicting the position of a person on future frames of a video stream is solved, and in-depth experimental studies on the application of traditional and SOTA blocks for this task are carried out. An original architecture of KeyFNet and its modifications based on transform blocks is presented, which is able to predict coordinates in the video stream for 30, 60, 90, and 120 frames ahead with high accuracy. The novelty lies in the application of a combined algorithm based on multiple FNet blocks with fast Fourier transform as an attention mechanism concatenating the coordinates of key points. Experiments on Human3.6M and on our own real data confirmed the effectiveness of the proposed approach based on FNet blocks, compared to the traditional approach based on LSTM. The proposed algorithm matches the accuracy of advanced models, but outperforms them in terms of speed, uses less computational resources, and thus can be applied in collaborative robotic solutions.</p>\",\"PeriodicalId\":0,\"journal\":{\"name\":\"\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0,\"publicationDate\":\"2024-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://link.springer.com/article/10.1134/S1064562423701624\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1134/S1064562423701624","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigation of Neural Network Algorithms for Human Movement Prediction Based on LSTM and Transformers
The problem of predicting the position of a person on future frames of a video stream is solved, and in-depth experimental studies on the application of traditional and SOTA blocks for this task are carried out. An original architecture of KeyFNet and its modifications based on transform blocks is presented, which is able to predict coordinates in the video stream for 30, 60, 90, and 120 frames ahead with high accuracy. The novelty lies in the application of a combined algorithm based on multiple FNet blocks with fast Fourier transform as an attention mechanism concatenating the coordinates of key points. Experiments on Human3.6M and on our own real data confirmed the effectiveness of the proposed approach based on FNet blocks, compared to the traditional approach based on LSTM. The proposed algorithm matches the accuracy of advanced models, but outperforms them in terms of speed, uses less computational resources, and thus can be applied in collaborative robotic solutions.