Yushu Liu, Weigang Zhang, Guorong Li, Li Su, Qingming Huang
{"title":"One-Shot Example Videos Localization Network for Weakly-Supervised Temporal Action Localization","authors":"Yushu Liu, Weigang Zhang, Guorong Li, Li Su, Qingming Huang","doi":"10.1109/MIPR51284.2021.00026","DOIUrl":null,"url":null,"abstract":"This paper tackles the problem of example-driven weakly-supervised temporal action localization. We propose the One-shot Example Videos Localization Network (OSEVLNet) for precisely localizing the action instances in untrimmed videos with only one trimmed example video. Since the frame-level ground truth is unavailable under weakly-supervised settings, our approach automatically trains a self-attention module with reconstruction and feature discrepancy restriction. Specifically, the reconstruction restriction minimizes the discrepancy between the original input features and the reconstructed features of a Variational AutoEncoder (VAE) module. The feature discrepancy restriction maximizes the distance of weighted features between highly-responsive regions and slightly-responsive regions. Our approach achieves comparable or better results on THUMOS’14 dataset than other weakly-supervised methods while it is trained with much less videos. Moreover, our approach is especially suitable for the expansion of newly emerging action categories to meet the requirements of different occasions.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPR51284.2021.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper tackles the problem of example-driven weakly-supervised temporal action localization. We propose the One-shot Example Videos Localization Network (OSEVLNet) for precisely localizing the action instances in untrimmed videos with only one trimmed example video. Since the frame-level ground truth is unavailable under weakly-supervised settings, our approach automatically trains a self-attention module with reconstruction and feature discrepancy restriction. Specifically, the reconstruction restriction minimizes the discrepancy between the original input features and the reconstructed features of a Variational AutoEncoder (VAE) module. The feature discrepancy restriction maximizes the distance of weighted features between highly-responsive regions and slightly-responsive regions. Our approach achieves comparable or better results on THUMOS’14 dataset than other weakly-supervised methods while it is trained with much less videos. Moreover, our approach is especially suitable for the expansion of newly emerging action categories to meet the requirements of different occasions.