基于文本感知子空间联合学习的弱监督动作分割和对齐

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.00798

Zijia Lu, Ehsan Elhamifar

{"title":"基于文本感知子空间联合学习的弱监督动作分割和对齐","authors":"Zijia Lu, Ehsan Elhamifar","doi":"10.1109/ICCV48922.2021.00798","DOIUrl":null,"url":null,"abstract":"We address the problem of learning to segment actions from weakly-annotated videos, i.e., videos accompanied by transcripts (ordered list of actions). We propose a framework in which we model actions with a union of low-dimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using the segmentations from the previous iteration, which we use for discriminative training of our model. To regularize the learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework that uses subset selection to find representative transcripts and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three datasets show that our method improves the state-of-the-art action segmentation and alignment, while speeding up the inference time by a factor of 4 to 13. 1","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"8065-8075"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning\",\"authors\":\"Zijia Lu, Ehsan Elhamifar\",\"doi\":\"10.1109/ICCV48922.2021.00798\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address the problem of learning to segment actions from weakly-annotated videos, i.e., videos accompanied by transcripts (ordered list of actions). We propose a framework in which we model actions with a union of low-dimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using the segmentations from the previous iteration, which we use for discriminative training of our model. To regularize the learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework that uses subset selection to find representative transcripts and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three datasets show that our method improves the state-of-the-art action segmentation and alignment, while speeding up the inference time by a factor of 4 to 13. 1\",\"PeriodicalId\":6820,\"journal\":{\"name\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"volume\":\"31 1\",\"pages\":\"8065-8075\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCV48922.2021.00798\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00798","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

我们解决了学习从弱注释视频中分割动作的问题，即带有抄本的视频(有序的动作列表)。我们提出了一个框架，在这个框架中，我们用低维子空间的联合来建模动作，使用成绩单学习子空间，并改进适合动作子空间的视频特征。为此，我们设计了一个由子空间联合网络组成的架构，该网络是一个自动编码器的集合，每个编码器都建模一个低维动作子空间，并且可以捕获视频内部和跨视频的动作变化。对于学习，在每次迭代中，我们使用前一次迭代的分割生成正和负软对齐矩阵，我们将其用于模型的判别训练。为了规范学习，我们引入了一个约束损失，以防止不平衡的分割，并强制每个动作在视频中相对相似的持续时间。为了进行实时推断，我们开发了一个分层分割框架，该框架使用子集选择来找到具有代表性的转录本，并分层地将测试视频与越来越精细的代表性转录本对齐。我们在三个数据集上的实验表明，我们的方法提高了最先进的动作分割和对齐，同时将推理时间加快了4到13倍。1

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning

We address the problem of learning to segment actions from weakly-annotated videos, i.e., videos accompanied by transcripts (ordered list of actions). We propose a framework in which we model actions with a union of low-dimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble of autoencoders, each modeling a low-dimensional action subspace and can capture variations of an action within and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using the segmentations from the previous iteration, which we use for discriminative training of our model. To regularize the learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework that uses subset selection to find representative transcripts and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three datasets show that our method improves the state-of-the-art action segmentation and alignment, while speeding up the inference time by a factor of 4 to 13. 1

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量