Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu
{"title":"FLAG3D++: A Benchmark for 3D Fitness Activity Comprehension With Language Instruction.","authors":"Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu","doi":"10.1109/tpami.2025.3590012","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"14 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590012","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.
期刊介绍:
The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.