Frontiers in Neurorobotics最新文献_第9页

Multimodal robot-assisted English writing guidance and error correction with reinforcement learning. 多模态机器人辅助英语写作指导与纠错强化学习。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-20 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1483131

Ni Wang

{"title":"Multimodal robot-assisted English writing guidance and error correction with reinforcement learning.","authors":"Ni Wang","doi":"10.3389/fnbot.2024.1483131","DOIUrl":"10.3389/fnbot.2024.1483131","url":null,"abstract":"Introduction: With the development of globalization and the increasing importance of English in international communication, effectively improving English writing skills has become a key focus in language learning. Traditional methods for English writing guidance and error correction have predominantly relied on rule-based approaches or statistical models, such as conventional language models and basic machine learning algorithms. While these methods can aid learners in improving their writing quality to some extent, they often suffer from limitations such as inflexibility, insufficient contextual understanding, and an inability to handle multimodal information. These shortcomings restrict their effectiveness in more complex linguistic environments.Methods: To address these challenges, this study introduces ETG-ALtrans, a multimodal robot-assisted English writing guidance and error correction technology based on an improved ALBEF model and VGG19 architecture, enhanced by reinforcement learning. The approach leverages VGG19 to extract visual features and integrates them with the ALBEF model, achieving precise alignment and fusion of images and text. This enhances the model's ability to comprehend context. Furthermore, by incorporating reinforcement learning, the model can adaptively refine its correction strategies, thereby optimizing the effectiveness of writing guidance.Results and discussion: Experimental results demonstrate that the proposed ETG-ALtrans method significantly improves the accuracy of English writing error correction and the intelligence level of writing guidance in multimodal data scenarios. Compared to traditional methods, this approach not only enhances the precision of writing suggestions but also better caters to the personalized needs of learners, thereby effectively improving their writing skills. This research is of significant importance in the field of language learning technology and offers new perspectives and methodologies for the development of future English writing assistance tools.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1483131"},"PeriodicalIF":2.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11614782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142779207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ISFM-SLAM: dynamic visual SLAM with instance segmentation and feature matching. ISFM-SLAM：基于实例分割和特征匹配的动态视觉SLAM。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-20 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1473937

Chao Li, Yang Hu, Jianqiang Liu, Jianhai Jin, Jun Sun

{"title":"ISFM-SLAM: dynamic visual SLAM with instance segmentation and feature matching.","authors":"Chao Li, Yang Hu, Jianqiang Liu, Jianhai Jin, Jun Sun","doi":"10.3389/fnbot.2024.1473937","DOIUrl":"10.3389/fnbot.2024.1473937","url":null,"abstract":"Introduction: Simultaneous Localization and Mapping (SLAM) is a technology used in intelligent systems such as robots and autonomous vehicles. Visual SLAM has become a more popular type of SLAM due to its acceptable cost and good scalability when applied in robot positioning, navigation and other functions. However, most of the visual SLAM algorithms assume a static environment, so when they are implemented in highly dynamic scenes, problems such as tracking failure and overlapped mapping are prone to occur.Methods: To deal with this issue, we propose ISFM-SLAM, a dynamic visual SLAM built upon the classic ORB-SLAM2, incorporating an improved instance segmentation network and enhanced feature matching. Based on YOLACT, the improved instance segmentation network applies the multi-scale residual network Res2Net as its backbone, and utilizes CIoU_Loss in the bounding box loss function, to enhance the detection accuracy of the segmentation network. To improve the matching rate and calculation efficiency of the internal feature points, we fuse ORB key points with an efficient image descriptor to replace traditional ORB feature matching of ORB-SLAM2. Moreover, the motion consistency detection algorithm based on external variance values is proposed and integrated into ISFM-SLAM, to assist the proposed SLAM systems in culling dynamic feature points more effectively.Results and discussion: Simulation results on the TUM dataset show that the overall pose estimation accuracy of the ISFM-SLAM is 97% better than the ORB-SLAM2, and is superior to other mainstream and state-of-the-art dynamic SLAM systems. Further real-world experiments validate the feasibility of the proposed SLAM system in practical applications.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1473937"},"PeriodicalIF":2.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11615477/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142779015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning-based object's stiffness and shape estimation with confidence level in multi-fingered hand grasping. 基于置信度的多指手抓取对象刚度和形状估计。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-19 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1466630

Kyo Kutsuzawa, Minami Matsumoto, Dai Owaki, Mitsuhiro Hayashibe

{"title":"Learning-based object's stiffness and shape estimation with confidence level in multi-fingered hand grasping.","authors":"Kyo Kutsuzawa, Minami Matsumoto, Dai Owaki, Mitsuhiro Hayashibe","doi":"10.3389/fnbot.2024.1466630","DOIUrl":"10.3389/fnbot.2024.1466630","url":null,"abstract":"Introduction: When humans grasp an object, they are capable of recognizing its characteristics, such as its stiffness and shape, through the sensation of their hands. They can also determine their level of confidence in the estimated object properties. In this study, we developed a method for multi-fingered hands to estimate both physical and geometric properties, such as the stiffness and shape of an object. Their confidence levels were measured using proprioceptive signals, such as joint angles and velocity.Method: We have developed a learning framework based on probabilistic inference that does not necessitate hyperparameters to maintain equilibrium between the estimation of diverse types of properties. Using this framework, we have implemented recurrent neural networks that estimate the stiffness and shape of grasped objects with their uncertainty in real time.Results: We demonstrated that the trained neural networks are capable of representing the confidence level of estimation that includes the degree of uncertainty and task difficulty in the form of variance and entropy.Discussion: We believe that this approach will contribute to reliable state estimation. Our approach would also be able to combine with flexible object manipulation and probabilistic inference-based decision making.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1466630"},"PeriodicalIF":2.6,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11611863/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142768248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal fusion-powered English speaking robot. 多模式融合动力说英语的机器人。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-15 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1478181

Ruiying Pan

{"title":"Multimodal fusion-powered English speaking robot.","authors":"Ruiying Pan","doi":"10.3389/fnbot.2024.1478181","DOIUrl":"https://doi.org/10.3389/fnbot.2024.1478181","url":null,"abstract":"Introduction: Speech recognition and multimodal learning are two critical areas in machine learning. Current multimodal speech recognition systems often encounter challenges such as high computational demands and model complexity.Methods: To overcome these issues, we propose a novel framework-EnglishAL-Net, a Multimodal Fusion-powered English Speaking Robot. This framework leverages the ALBEF model, optimizing it for real-time speech and multimodal interaction, and incorporates a newly designed text and image editor to fuse visual and textual information. The robot processes dynamic spoken input through the integration of Neural Machine Translation (NMT), enhancing its ability to understand and respond to spoken language.Results and discussion: In the experimental section, we constructed a dataset containing various scenarios and oral instructions for testing. The results show that compared to traditional unimodal processing methods, our model significantly improves both language understanding accuracy and response time. This research not only enhances the performance of multimodal interaction in robots but also opens up new possibilities for applications of robotic technology in education, rescue, customer service, and other fields, holding significant theoretical and practical value.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1478181"},"PeriodicalIF":2.6,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11604748/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142768250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UnionCAM: enhancing CNN interpretability through denoising, weighted fusion, and selective high-quality class activation mapping. UnionCAM：通过去噪、加权融合和选择性高质量的类激活映射增强CNN的可解释性。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-14 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1490198

Hao Hu, Rui Wang, Hao Lin, Huai Yu

{"title":"UnionCAM: enhancing CNN interpretability through denoising, weighted fusion, and selective high-quality class activation mapping.","authors":"Hao Hu, Rui Wang, Hao Lin, Huai Yu","doi":"10.3389/fnbot.2024.1490198","DOIUrl":"10.3389/fnbot.2024.1490198","url":null,"abstract":"Deep convolutional neural networks (CNNs) have achieved remarkable success in various computer vision tasks. However, the lack of interpretability in these models has raised concerns and hindered their widespread adoption in critical domains. Generating activation maps that highlight the regions contributing to the CNN's decision has emerged as a popular approach to visualize and interpret these models. Nevertheless, existing methods often produce activation maps contaminated with irrelevant background noise or incomplete object activation, limiting their effectiveness in providing meaningful explanations. To address this challenge, we propose Union Class Activation Mapping (UnionCAM), an innovative visual interpretation framework that generates high-quality class activation maps (CAMs) through a novel three-step approach. UnionCAM introduces a weighted fusion strategy that adaptively combines multiple CAMs to create more informative and comprehensive activation maps. First, the denoising module removes background noise from CAMs by using adaptive thresholding. Subsequently, the union module fuses the denoised CAMs with region-based CAMs using a weighted combination scheme to obtain more comprehensive and informative maps, which we refer to as fused CAMs. Lastly, the activation map selection module automatically selects the optimal CAM that offers the best interpretation from the pool of fused CAMs. Extensive experiments on ILSVRC2012 and VOC2007 datasets demonstrate UnionCAM's superior performance over state-of-the-art methods. It effectively suppresses background noise, captures complete object regions, and provides intuitive visual explanations. UnionCAM achieves significant improvements in insertion and deletion scores, outperforming the best baseline. UnionCAM makes notable contributions by introducing a novel denoising strategy, adaptive fusion of CAMs, and an automatic selection mechanism. It bridges the gap between CNN performance and interpretability, providing a valuable tool for understanding and trusting CNN-based systems. UnionCAM has the potential to foster responsible deployment of CNNs in real-world applications.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1490198"},"PeriodicalIF":2.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11602493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142750018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time fault detection for IIoT facilities using GA-Att-LSTM based on edge-cloud collaboration. 基于边缘-云协作，使用 GA-Att-LSTM 对 IIoT 设施进行实时故障检测。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-11 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1499703

Jiuling Dong, Zehui Li, Yuanshuo Zheng, Jingtang Luo, Min Zhang, Xiaolong Yang

{"title":"Real-time fault detection for IIoT facilities using GA-Att-LSTM based on edge-cloud collaboration.","authors":"Jiuling Dong, Zehui Li, Yuanshuo Zheng, Jingtang Luo, Min Zhang, Xiaolong Yang","doi":"10.3389/fnbot.2024.1499703","DOIUrl":"10.3389/fnbot.2024.1499703","url":null,"abstract":"With the rapid development of Industrial Internet of Things (IIoT) technology, various IIoT devices are generating large amounts of industrial sensor data that are spatiotemporally correlated and heterogeneous from multi-source and multi-domain. This poses a challenge to current detection algorithms. Therefore, this paper proposes an improved long short-term memory (LSTM) neural network model based on the genetic algorithm, attention mechanism and edge-cloud collaboration (GA-Att-LSTM) framework is proposed to detect anomalies of IIoT facilities. Firstly, an edge-cloud collaboration framework is established to real-time process a large amount of sensor data at the edge node in real time, which reduces the time of uploading sensor data to the cloud platform. Secondly, to overcome the problem of insufficient attention to important features in the input sequence in traditional LSTM algorithms, we introduce an attention mechanism to adaptively adjust the weights of important features in the model. Meanwhile, a genetic algorithm optimized hyperparameters of the LSTM neural network is proposed to transform anomaly detection into a classification problem and effectively extract the correlation of time-series data, which improves the recognition rate of fault detection. Finally, the proposed method has been evaluated on a publicly available fault database. The results indicate an accuracy of 99.6%, an F1-score of 84.2%, a precision of 89.8%, and a recall of 77.6%, all of which exceed the performance of five traditional machine learning methods.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1499703"},"PeriodicalIF":2.6,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586361/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142716239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time location of acupuncture points based on anatomical landmarks and pose estimation models. 基于解剖标志和位姿估计模型的穴位实时定位。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-08 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1484038

Hadi Sedigh Malekroodi, Seon-Deok Seo, Jinseong Choi, Chang-Soo Na, Byeong-Il Lee, Myunggi Yi

{"title":"Real-time location of acupuncture points based on anatomical landmarks and pose estimation models.","authors":"Hadi Sedigh Malekroodi, Seon-Deok Seo, Jinseong Choi, Chang-Soo Na, Byeong-Il Lee, Myunggi Yi","doi":"10.3389/fnbot.2024.1484038","DOIUrl":"https://doi.org/10.3389/fnbot.2024.1484038","url":null,"abstract":"Introduction: Precise identification of acupuncture points (acupoints) is essential for effective treatment, but manual location by untrained individuals can often lack accuracy and consistency. This study proposes two approaches that use artificial intelligence (AI) specifically computer vision to automatically and accurately identify acupoints on the face and hand in real-time, enhancing both precision and accessibility in acupuncture practices.Methods: The first approach applies a real-time landmark detection system to locate 38 specific acupoints on the face and hand by translating anatomical landmarks from image data into acupoint coordinates. The second approach uses a convolutional neural network (CNN) specifically optimized for pose estimation to detect five key acupoints on the arm and hand (LI11, LI10, TE5, TE3, LI4), drawing on constrained medical imaging data for training. To validate these methods, we compared the predicted acupoint locations with those annotated by experts.Results: Both approaches demonstrated high accuracy, with mean localization errors of less than 5 mm when compared to expert annotations. The landmark detection system successfully mapped multiple acupoints across the face and hand even in complex imaging scenarios. The data-driven approach accurately detected five arm and hand acupoints with a mean Average Precision (mAP) of 0.99 at OKS 50%.Discussion: These AI-driven methods establish a solid foundation for the automated localization of acupoints, enhancing both self-guided and professional acupuncture practices. By enabling precise, real-time localization of acupoints, these technologies could improve the accuracy of treatments, facilitate self-training, and increase the accessibility of acupuncture. Future developments could expand these models to include additional acupoints and incorporate them into intuitive applications for broader use.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1484038"},"PeriodicalIF":2.6,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11609928/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142768252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vahagn: VisuAl Haptic Attention Gate Net for slip detection. Vahagn：用于滑倒检测的可视触觉注意力门网

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-11-06 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1484751

Jinlin Wang, Yulong Ji, Hongyu Yang

{"title":"Vahagn: VisuAl Haptic Attention Gate Net for slip detection.","authors":"Jinlin Wang, Yulong Ji, Hongyu Yang","doi":"10.3389/fnbot.2024.1484751","DOIUrl":"10.3389/fnbot.2024.1484751","url":null,"abstract":"Introduction: Slip detection is crucial for achieving stable grasping and subsequent operational tasks. A grasp action is a continuous process that requires information from multiple sources. The success of a specific grasping maneuver is contingent upon the confluence of two factors: the spatial accuracy of the contact and the stability of the continuous process.Methods: In this paper, for the task of perceiving grasping results using visual-haptic information, we propose a new method for slip detection, which synergizes visual and haptic information from spatial-temporal dual dimensions. Specifically, the method takes as input a sequence of visual images from a first-person perspective and a sequence of haptic images from a gripper. Then, it extracts time-dependent features of the whole process and spatial features matching the importance of different parts with different attention mechanisms. Inspired by neurological studies, during the information fusion process, we adjusted temporal and spatial information from vision and haptic through a combination of two-step fusion and gate units.Results and discussion: To validate the effectiveness of method, we compared it with traditional CNN net and models with attention. It is anticipated that our method achieves a classification accuracy of 93.59%, which is higher than that of previous works. Attention visualization is further presented to support the validity.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1484751"},"PeriodicalIF":2.6,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142681508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A multimodal educational robots driven via dynamic attention. 通过动态注意力驱动的多模式教育机器人

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-10-31 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1453061

An Jianliang

{"title":"A multimodal educational robots driven via dynamic attention.","authors":"An Jianliang","doi":"10.3389/fnbot.2024.1453061","DOIUrl":"10.3389/fnbot.2024.1453061","url":null,"abstract":"Introduction: With the development of artificial intelligence and robotics technology, the application of educational robots in teaching is becoming increasingly popular. However, effectively evaluating and optimizing multimodal educational robots remains a challenge.Methods: This study introduces Res-ALBEF, a multimodal educational robot framework driven by dynamic attention. Res-ALBEF enhances the ALBEF (Align Before Fuse) method by incorporating residual connections to align visual and textual data more effectively before fusion. In addition, the model integrates a VGG19-based convolutional network for image feature extraction and utilizes a dynamic attention mechanism to dynamically focus on relevant parts of multimodal inputs. Our model was trained using a diverse dataset consisting of 50,000 multimodal educational instances, covering a variety of subjects and instructional content.Results and discussion: The evaluation on an independent validation set of 10,000 samples demonstrated significant performance improvements: the model achieved an overall accuracy of 97.38% in educational content recognition. These results highlight the model's ability to improve alignment and fusion of multimodal information, making it a robust solution for multimodal educational robots.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1453061"},"PeriodicalIF":2.6,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560911/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142618559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference. LS-VIT：基于长短时间差的动作识别视觉转换器。

IF 2.6 4区计算机科学

Frontiers in Neurorobotics Pub Date : 2024-10-31 eCollection Date: 2024-01-01 DOI: 10.3389/fnbot.2024.1457843

Dong Chen, Peisong Wu, Mingdong Chen, Mengtao Wu, Tao Zhang, Chuanqi Li

{"title":"LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference.","authors":"Dong Chen, Peisong Wu, Mingdong Chen, Mengtao Wu, Tao Zhang, Chuanqi Li","doi":"10.3389/fnbot.2024.1457843","DOIUrl":"10.3389/fnbot.2024.1457843","url":null,"abstract":"Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1457843"},"PeriodicalIF":2.6,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560894/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142618575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0