This review provides an in-depth exploration of the field of animal action recognition, focusing on coarse-grained (CG) and fine-grained (FG) techniques. The primary aim is to examine the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments. These challenges differ significantly from those encountered in human action recognition due to factors such as non-rigid body structures, frequent occlusions, and the lack of large-scale, annotated datasets. This review underscores the critical differences between human and animal action recognition. While inspired by progress in the human domain, animal action recognition presents unique challenges due to high intra-species variability, complex environmental interactions, and unstructured datasets that human-centric models cannot fully address. Recent multimodal frameworks such as ARTEMIS and MSQNet exemplify state-of-the-art progress by integrating textual cues derived from video with visual and audio modalities. When considered alongside established spatio-temporal architectures like SlowFast, these developments signal a shift toward richer multimodal paradigms in behaviour analysis. By assessing the strengths and weaknesses of current methodologies and introducing a recently published dataset, the review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species. This review extends beyond earlier reviews by offering the first systematic treatment of coarse-grained (CG) and fine-grained (FG) action recognition in animals.



