FLAG3D++: A Benchmark for 3D Fitness Activity Comprehension With Language Instruction.

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-07-17 DOI:10.1109/tpami.2025.3590012

Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu

{"title":"FLAG3D++: A Benchmark for 3D Fitness Activity Comprehension With Language Instruction.","authors":"Yansong Tang,Aoyang Liu,Jinpeng Liu,Shiyi Zhang,Wenxun Dai,Jie Zhou,Xiu Li,Jiwen Lu","doi":"10.1109/tpami.2025.3590012","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"14 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590012","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.

查看原文本刊更多论文

flag3d++：基于语言教学的三维健身活动理解基准。

近年来，人类对一般行为的理解得到了迅速发展。然而，当应用于现实世界的应用，如体育分析时，大多数现有的数据集仍然不令人满意，因为多任务的丰富标签，语言指令，高质量的3D数据和不同的环境的限制。在本文中，我们提出了flag3d++，一个大规模的三维健身活动理解基准，它包含60个活动类别的180 K序列，并带有语言指令。flag3d++具有以下四个方面的特点：1)细粒度的注释在未修剪的长序列和如何执行这些动作的时间间隔，2)详细和专业的语言指令来描述如何执行特定的活动，3)准确和密集的3D人体姿势从先进的动作捕捉系统来处理复杂的活动和大的运动，4)多功能视频资源从高科技的动作捕捉系统，渲染软件，并在自然环境中具有成本效益的智能手机。针对上述特点，本文提出了语言引导重复动作计数（L-RAC）和语言引导动作质量评估（L-AQA）两种新的实际应用，旨在以语言描述为参考，分别对动作的重复次数进行计数和对动作质量进行评估。为了更好地融合L-RAC和L-AQA的语言信息和骨架序列，我们提出了一种层次语言引导图卷积网络（HL-GCN）模型。具体来说，HL-GCN通过早期融合语言特征和由多个中间图卷积层编码的基于骨架的序列的分层节点特征来执行跨模态对齐。大量的实验证明了我们的HL-GCN在L-RAC和L-AQA上的优势，以及flag3d++在各种挑战，如动态人体网格恢复和跨域人体动作识别方面的巨大研究价值。我们的数据集，源代码和训练模型在flag3d++上公开提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.