Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-04-02 DOI:10.1109/TETCI.2024.3378649

Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng

{"title":"Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition","authors":"Lianyu Hu;Liqing Gao;Zekang Liu;Wei Feng","doi":"10.1109/TETCI.2024.3378649","DOIUrl":null,"url":null,"abstract":"Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 6","pages":"3925-3935"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10488467/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.

查看原文本刊更多论文

时空聚合实现高效连续手语识别

尽管最近连续手语识别（CSLR）取得了进展，但大多数最先进的方法都是逐帧处理输入的手语视频来预测句子。这通常会造成沉重的计算负担，在现实世界中效率低下，甚至不可行。视频本身是冗余的，并非所有帧都是识别所必需的，受此启发，我们提出了空间时间聚合（STAgg）来解决这一问题。具体来说，STAgg 将相邻的相似帧合成为统一的鲁棒表示，然后再输入识别模块，从而大大降低了计算复杂度和内存需求。我们首先详细分析了子采样、最大池化和平均等常用的聚合方法，然后根据预期的设计准则自然地推导出我们的 STAgg。与常用的池化和子样本对应方法相比，大量的消融研究验证了我们提出的三种不同的 STAgg 变体在准确性和效率方面的优越性。最佳版本的准确率与最先进的竞争对手相当，但速度快 1.35 倍，计算成本仅为 0.50 倍，训练时间为 0.70 倍，内存使用率为 0.65 倍。在多个骨干网上对四个大规模数据集进行的实验充分验证了所提出的 STAgg 的通用性和有效性。STAgg 的另一个优势是支持更强大的骨干网，这可能会在类似的计算/内存预算下进一步提高 CSLR 的准确性。我们还将 STAgg 的结果可视化，以支持对 STAgg 效果进行直观、深入的分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.