{"title":"Two Stream Dynamic Threshold Network for Weakly-Supervised Temporal Action Localization","authors":"Hao Yan, Jun Cheng, Qieshi Zhang, Ziliang Ren, Shijie Sun, Qin Cheng","doi":"10.1109/RCAR52367.2021.9517513","DOIUrl":null,"url":null,"abstract":"The current mainstream temporal action localization methods are fully-supervised, which needs a lot of time to annotate the required frame-level labels. The emergence of weakly-supervised methods greatly alleviates such problem, as they only require video-level labels to train the models. In order to generate accurate action localization boundary, recent two stream consensus network (TSCN) proposes an attention normalization loss to explicitly force the attention values approach extreme values to avoid ambiguity. However, most previous methods including TSCN use a fixed threshold applied on attention loss to polarize the attention values, which lacks flexibility for different videos. In this paper, we propose a Dynamic Threshold Weakly-supervised action Localization (DH-WTAL) method to address this problem. The proposed DH-WTAL features a dynamic attention threshold decision for the attention mechanism. Specifically, the dynamic threshold can dynamically control the number of snippets selected for different videos, which further adjust the extreme values of the attention mechanism for different videos accordingly. Extensive experiments demonstrate that the proposed DH-WTAL outperforms the TSCN baseline, and ablation study validates the effectiveness of this method.","PeriodicalId":232892,"journal":{"name":"2021 IEEE International Conference on Real-time Computing and Robotics (RCAR)","volume":"60 6-7","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Real-time Computing and Robotics (RCAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RCAR52367.2021.9517513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The current mainstream temporal action localization methods are fully-supervised, which needs a lot of time to annotate the required frame-level labels. The emergence of weakly-supervised methods greatly alleviates such problem, as they only require video-level labels to train the models. In order to generate accurate action localization boundary, recent two stream consensus network (TSCN) proposes an attention normalization loss to explicitly force the attention values approach extreme values to avoid ambiguity. However, most previous methods including TSCN use a fixed threshold applied on attention loss to polarize the attention values, which lacks flexibility for different videos. In this paper, we propose a Dynamic Threshold Weakly-supervised action Localization (DH-WTAL) method to address this problem. The proposed DH-WTAL features a dynamic attention threshold decision for the attention mechanism. Specifically, the dynamic threshold can dynamically control the number of snippets selected for different videos, which further adjust the extreme values of the attention mechanism for different videos accordingly. Extensive experiments demonstrate that the proposed DH-WTAL outperforms the TSCN baseline, and ablation study validates the effectiveness of this method.