{"title":"Global Spatial-Temporal Information Encoder-Decoder Based Action Segmentation in Untrimmed Video","authors":"Yichao Liu;Yiyang Sun;Zhide Chen;Chen Feng;Kexin Zhu","doi":"10.26599/TST.2024.9010041","DOIUrl":null,"url":null,"abstract":"Action segmentation has made significant progress, but segmenting and recognizing actions from untrimmed long videos remains a challenging problem. Most state-of-the-art methods focus on designing models based on temporal convolution. However, the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models. To address the issue of over-segmentation in existing action segmentation methods, which leads to classification errors and reduced segmentation quality, this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method. The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder (ED) structure in judging the action segmentation point more accurately and, at the same time, suppress the excessive segmentation phenomenon caused by the ED structure. The method proposed in this paper achieves 93% frame accuracy on the constructed real Tai Chi action dataset. The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 1","pages":"290-302"},"PeriodicalIF":6.6000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10676351","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10676351/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0
Abstract
Action segmentation has made significant progress, but segmenting and recognizing actions from untrimmed long videos remains a challenging problem. Most state-of-the-art methods focus on designing models based on temporal convolution. However, the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models. To address the issue of over-segmentation in existing action segmentation methods, which leads to classification errors and reduced segmentation quality, this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method. The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder (ED) structure in judging the action segmentation point more accurately and, at the same time, suppress the excessive segmentation phenomenon caused by the ED structure. The method proposed in this paper achieves 93% frame accuracy on the constructed real Tai Chi action dataset. The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.
动作分割技术已经取得了重大进展,但从未经剪辑的长视频中分割和识别动作仍然是一个具有挑战性的问题。大多数最先进的方法都侧重于设计基于时态卷积的模型。然而,长期时间依赖性建模的局限性和时间卷积的不灵活性限制了这些模型的潜力。为了解决现有动作分割方法中存在的过度分割问题,即导致分类错误和分割质量下降的问题,本文提出了一种基于全局时空信息编码器-解码器的动作分割方法。本文提出的方法利用细化层捕获的全局时空信息,辅助编码器-解码器(ED)结构更准确地判断动作分割点,同时抑制 ED 结构造成的过度分割现象。本文提出的方法在构建的真实太极拳动作数据集上实现了 93% 的帧准确率。实验结果证明,该方法可以准确高效地完成长视频动作分割任务。
期刊介绍:
Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.