{"title":"AnimalRTPose: Faster cross-species real-time animal pose estimation","authors":"Xin Wu , Lianming Wang , Jipeng Huang","doi":"10.1016/j.neunet.2025.107685","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advancements in computer vision have facilitated the development of sophisticated tools for analyzing complex animal behaviors, yet the diversity of animal morphology and environmental complexities present significant challenges to real-time animal pose estimation. To address these challenges, we introduce AnimalRTPose, a one-stage model designed for cross-species real-time animal pose estimation. At its core, AnimalRTPose leverages CSPNeXt<span><math><msup><mrow></mrow><mrow><mi>†</mi></mrow></msup></math></span>, a novel backbone network that integrates depthwise separable convolution with skip connections for high-frequency feature extraction, a channel attention mechanism (CAM) to enhance the fusion of high-frequency and low-frequency features, and spatial pyramid pooling (SPP) to capture multi-scale contextual information. This architecture enables robust feature representation across varying spatial resolutions, enhancing adaptability to diverse species and environments. Additionally, AnimalRTPose incorporates an efficient multi-scale feature fusion module that dynamically balances local detail and global structural consistency, ensuring high accuracy and robustness in pose estimation. Designed for scalability and versatility, AnimalRTPose supports single-animal, multi-animal, cross-species, and few-shot scenarios. Specifically, AnimalRTPose-N achieves 476 FPS on NVIDIA RTX 2080Ti, 769 FPS on NVIDIA RTX 3090, and 1111 FPS on NVIDIA A800, while demonstrating high throughput on edge devices with 196 FPS on the NVIDIA Jetson™ AGX Orin Developer Kit (275 TOPS, 15 W to 60 W), 77 FPS on the Raspberry Pi 5 with AI HAT+ (26 TOPS, 25 W), and 64 FPS on the Atlas 200I Developer Kit A2 (8 TOPS, 24 W), all with a 640 × 640 input resolution. These results surpass all existing one-stage models, showcasing its superior performance in real-time animal pose estimation. AnimalRTPose is thus highly applicable for scenarios requiring real-time animal behavior monitoring. Further details on the model configuration and dataset are available on the <span><span>AnimalRTPose</span><svg><path></path></svg></span> project website.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107685"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025005659","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in computer vision have facilitated the development of sophisticated tools for analyzing complex animal behaviors, yet the diversity of animal morphology and environmental complexities present significant challenges to real-time animal pose estimation. To address these challenges, we introduce AnimalRTPose, a one-stage model designed for cross-species real-time animal pose estimation. At its core, AnimalRTPose leverages CSPNeXt, a novel backbone network that integrates depthwise separable convolution with skip connections for high-frequency feature extraction, a channel attention mechanism (CAM) to enhance the fusion of high-frequency and low-frequency features, and spatial pyramid pooling (SPP) to capture multi-scale contextual information. This architecture enables robust feature representation across varying spatial resolutions, enhancing adaptability to diverse species and environments. Additionally, AnimalRTPose incorporates an efficient multi-scale feature fusion module that dynamically balances local detail and global structural consistency, ensuring high accuracy and robustness in pose estimation. Designed for scalability and versatility, AnimalRTPose supports single-animal, multi-animal, cross-species, and few-shot scenarios. Specifically, AnimalRTPose-N achieves 476 FPS on NVIDIA RTX 2080Ti, 769 FPS on NVIDIA RTX 3090, and 1111 FPS on NVIDIA A800, while demonstrating high throughput on edge devices with 196 FPS on the NVIDIA Jetson™ AGX Orin Developer Kit (275 TOPS, 15 W to 60 W), 77 FPS on the Raspberry Pi 5 with AI HAT+ (26 TOPS, 25 W), and 64 FPS on the Atlas 200I Developer Kit A2 (8 TOPS, 24 W), all with a 640 × 640 input resolution. These results surpass all existing one-stage models, showcasing its superior performance in real-time animal pose estimation. AnimalRTPose is thus highly applicable for scenarios requiring real-time animal behavior monitoring. Further details on the model configuration and dataset are available on the AnimalRTPose project website.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.