{"title":"Tackling confusion among actions for action segmentation with adaptive margin and energy-driven refinement","authors":"Zhichao Ma, Kan Li","doi":"10.1007/s00138-023-01505-z","DOIUrl":null,"url":null,"abstract":"<p>Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-023-01505-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video action segmentation is a crucial task in evaluating the ability to understand human activities. Previous works on this task mainly focus on capturing complex temporal structures and fail to consider the feature ambiguity among similar actions and the biased training sets, thus they are easy to confuse some actions. In this paper, we propose a novel action segmentation framework, called DeConfuNet, to solve the above issue. First, we design a discriminative enhancement module (DEM) trained by an adaptive margin-guided discriminative feature learning which adjusts the margin adaptively to increase the feature distinguishability among similar actions, and whose multi-stage reasoning and adaptive feature fusion structures provide structural advantages for distinguishing similar actions. Second, we propose an equalizing influence module (EIM) that can overcome the impact of biased training sets by balancing the influence of training samples under a coefficient-adaptive loss function. Third, an energy and context-driven refinement module (ECRM) further alleviates the impact of the unbalanced influence of training samples by fusing and refining the inference of DEM and EIM, which utilizes the phased prediction including context and energy clues to assimilate untrustworthy segments, alleviating over-segmentation hugely. Extensive experiments show the effectiveness of each proposed technique, they verify that the DEM and EIM are complementary in reasoning and cooperate to overcome the confusion issue, and our approach achieves significant improvement and state-of-the-art performance of accuracy, edit score, and F1 score on the challenging 50Salads, GTEA, and Breakfast benchmarks.
期刊介绍:
Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal.
Particular emphasis is placed on engineering and technology aspects of image processing and computer vision.
The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.