实现对时空图卷积网络的几何理解

IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2024-03-03 DOI:10.1109/OJSP.2024.3396635

Pratyusha Das;Sarath Shekkizhar;Antonio Ortega

{"title":"实现对时空图卷积网络的几何理解","authors":"Pratyusha Das;Sarath Shekkizhar;Antonio Ortega","doi":"10.1109/OJSP.2024.3396635","DOIUrl":null,"url":null,"abstract":"Spatiotemporal graph convolutional networks (STGCNs) have emerged as a desirable model for \n<italic>skeleton</i>\n-based human action recognition. Despite achieving state-of-the-art performance, there is a limited understanding of the representations learned by these models, which hinders their application in critical and real-world settings. While layerwise analysis of CNN models has been studied in the literature, to the best of our knowledge, there exists \n<italic>no study</i>\n on the layerwise explainability of the embeddings learned on spatiotemporal data using STGCNs. In this paper, we first propose to use a local Dataset Graph (DS-Graph) obtained from the feature representation of input data at each layer to develop an understanding of the layer-wise embedding geometry of the STGCN. To do so, we develop a window-based dynamic time warping (DTW) method to compute the distance between data sequences with varying temporal lengths. To validate our findings, we have developed a layer-specific Spatiotemporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) technique tailored for spatiotemporal data. This approach enables us to visually analyze and interpret each layer within the STGCN network. We characterize the functions learned by each layer of the STGCN using the label smoothness of the representation and visualize them using our L-STG-GradCAM approach. Our proposed method is generic and can yield valuable insights for STGCN architectures in different applications. However, this paper focuses on the human activity recognition task as a representative application. Our experiments show that STGCN models learn representations that capture general human motion in their initial layers while discriminating different actions only in later layers. This justifies experimental observations showing that fine-tuning deeper layers works well for transfer between related tasks. We provide experimental evidence for different human activity datasets and advanced spatiotemporal graph networks to validate that the proposed method is general enough to analyze any STGCN model and can be useful for drawing insight into networks in various scenarios. We also show that noise at the input has a limited effect on label smoothness, which can help justify the robustness of STGCNs to noise.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"5 ","pages":"1023-1030"},"PeriodicalIF":2.9000,"publicationDate":"2024-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10518107","citationCount":"0","resultStr":"{\"title\":\"Towards a Geometric Understanding of Spatiotemporal Graph Convolution Networks\",\"authors\":\"Pratyusha Das;Sarath Shekkizhar;Antonio Ortega\",\"doi\":\"10.1109/OJSP.2024.3396635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spatiotemporal graph convolutional networks (STGCNs) have emerged as a desirable model for \\n<italic>skeleton</i>\\n-based human action recognition. Despite achieving state-of-the-art performance, there is a limited understanding of the representations learned by these models, which hinders their application in critical and real-world settings. While layerwise analysis of CNN models has been studied in the literature, to the best of our knowledge, there exists \\n<italic>no study</i>\\n on the layerwise explainability of the embeddings learned on spatiotemporal data using STGCNs. In this paper, we first propose to use a local Dataset Graph (DS-Graph) obtained from the feature representation of input data at each layer to develop an understanding of the layer-wise embedding geometry of the STGCN. To do so, we develop a window-based dynamic time warping (DTW) method to compute the distance between data sequences with varying temporal lengths. To validate our findings, we have developed a layer-specific Spatiotemporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) technique tailored for spatiotemporal data. This approach enables us to visually analyze and interpret each layer within the STGCN network. We characterize the functions learned by each layer of the STGCN using the label smoothness of the representation and visualize them using our L-STG-GradCAM approach. Our proposed method is generic and can yield valuable insights for STGCN architectures in different applications. However, this paper focuses on the human activity recognition task as a representative application. Our experiments show that STGCN models learn representations that capture general human motion in their initial layers while discriminating different actions only in later layers. This justifies experimental observations showing that fine-tuning deeper layers works well for transfer between related tasks. We provide experimental evidence for different human activity datasets and advanced spatiotemporal graph networks to validate that the proposed method is general enough to analyze any STGCN model and can be useful for drawing insight into networks in various scenarios. We also show that noise at the input has a limited effect on label smoothness, which can help justify the robustness of STGCNs to noise.\",\"PeriodicalId\":73300,\"journal\":{\"name\":\"IEEE open journal of signal processing\",\"volume\":\"5 \",\"pages\":\"1023-1030\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10518107\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of signal processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10518107/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10518107/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

时空图卷积网络（STGCN）已成为基于骨骼的人类动作识别的理想模型。尽管其性能达到了最先进的水平，但人们对这些模型学习到的表征的了解却很有限，这阻碍了它们在关键和真实世界环境中的应用。虽然已有文献对 CNN 模型进行了分层分析，但据我们所知，目前还没有关于使用 STGCN 对时空数据所学嵌入的分层可解释性的研究。在本文中，我们首先提议使用从每层输入数据的特征表示中获得的局部数据集图（DS-Graph）来理解 STGCN 的层上嵌入几何。为此，我们开发了一种基于窗口的动态时间扭曲（DTW）方法，用于计算不同时间长度的数据序列之间的距离。为了验证我们的研究结果，我们开发了一种专为时空数据定制的特定层时空图梯度加权类激活映射（L-STG-GradCAM）技术。这种方法使我们能够直观地分析和解释 STGCN 网络中的每一层。我们使用表征的标签平滑度来描述 STGCN 每一层学习到的函数，并使用 L-STG-GradCAM 方法将其可视化。我们提出的方法具有通用性，可以为不同应用中的 STGCN 架构提供有价值的见解。不过，本文重点讨论的是人类活动识别任务这一代表性应用。我们的实验表明，STGCN 模型在初始层中学习的表征可以捕捉到一般的人体运动，而在后面的层中只能识别不同的动作。这证明了实验观察结果，即微调更深层次的表征对相关任务之间的转换非常有效。我们为不同的人类活动数据集和高级时空图网络提供了实验证据，以验证所提出的方法具有足够的通用性，可以分析任何 STGCN 模型，并有助于深入了解各种情况下的网络。我们还表明，输入噪声对标签平滑度的影响有限，这有助于证明 STGCN 对噪声的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards a Geometric Understanding of Spatiotemporal Graph Convolution Networks

Spatiotemporal graph convolutional networks (STGCNs) have emerged as a desirable model for skeleton -based human action recognition. Despite achieving state-of-the-art performance, there is a limited understanding of the representations learned by these models, which hinders their application in critical and real-world settings. While layerwise analysis of CNN models has been studied in the literature, to the best of our knowledge, there exists no study on the layerwise explainability of the embeddings learned on spatiotemporal data using STGCNs. In this paper, we first propose to use a local Dataset Graph (DS-Graph) obtained from the feature representation of input data at each layer to develop an understanding of the layer-wise embedding geometry of the STGCN. To do so, we develop a window-based dynamic time warping (DTW) method to compute the distance between data sequences with varying temporal lengths. To validate our findings, we have developed a layer-specific Spatiotemporal Graph Gradient-weighted Class Activation Mapping (L-STG-GradCAM) technique tailored for spatiotemporal data. This approach enables us to visually analyze and interpret each layer within the STGCN network. We characterize the functions learned by each layer of the STGCN using the label smoothness of the representation and visualize them using our L-STG-GradCAM approach. Our proposed method is generic and can yield valuable insights for STGCN architectures in different applications. However, this paper focuses on the human activity recognition task as a representative application. Our experiments show that STGCN models learn representations that capture general human motion in their initial layers while discriminating different actions only in later layers. This justifies experimental observations showing that fine-tuning deeper layers works well for transfer between related tasks. We provide experimental evidence for different human activity datasets and advanced spatiotemporal graph networks to validate that the proposed method is general enough to analyze any STGCN model and can be useful for drawing insight into networks in various scenarios. We also show that noise at the input has a limited effect on label smoothness, which can help justify the robustness of STGCNs to noise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE open journal of signal processing

CiteScore

5.30

自引率

0.00%

发文量

审稿时长

22 weeks