Achieving Procedure-Aware Instructional Video Correlation Learning Under Weak Supervision from a Collaborative Perspective

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-11-04 DOI:10.1007/s11263-024-02272-8

Tianyao He, Huabin Liu, Zelin Ni, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Weiyao Lin

{"title":"Achieving Procedure-Aware Instructional Video Correlation Learning Under Weak Supervision from a Collaborative Perspective","authors":"Tianyao He, Huabin Liu, Zelin Ni, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Weiyao Lin","doi":"10.1007/s11263-024-02272-8","DOIUrl":null,"url":null,"abstract":"<p>Video Correlation Learning (VCL) delineates a high-level research domain that centers on analyzing the semantic and temporal correspondences between videos through a comparative paradigm. Recently, instructional video-related tasks have drawn increasing attention due to their promising potential. Compared with general videos, instructional videos possess more complex procedure information, making correlation learning quite challenging. To obtain procedural knowledge, current methods rely heavily on fine-grained step-level annotations, which are costly and non-scalable. To improve VCL on instructional videos, we introduce a weakly supervised framework named Collaborative Procedure Alignment (CPA). To be specific, our framework comprises two core components: the collaborative step mining (CSM) module and the frame-to-step alignment (FSA) module. Free of the necessity for step-level annotations, the CSM module can properly conduct temporal step segmentation and pseudo-step learning by exploring the inner procedure correspondences between paired videos. Subsequently, the FSA module efficiently yields the probability of aligning one video’s frame-level features with another video’s pseudo-step labels, which can act as a reliable correlation degree for paired videos. The two modules are inherently interconnected and can mutually enhance each other to extract the step-level knowledge and measure the video correlation distances accurately. Our framework provides an effective tool for instructional video correlation learning. We instantiate our framework on four representative tasks, including sequence verification, few-shot action recognition, temporal action segmentation, and action quality assessment. Furthermore, we extend our framework to more innovative functions to further exhibit its potential. Extensive and in-depth experiments validate CPA’s strong correlation learning capability on instructional videos. The implementation can be found at https://github.com/hotelll/Collaborative_Procedure_Alignment.\n</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"109 4 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02272-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video Correlation Learning (VCL) delineates a high-level research domain that centers on analyzing the semantic and temporal correspondences between videos through a comparative paradigm. Recently, instructional video-related tasks have drawn increasing attention due to their promising potential. Compared with general videos, instructional videos possess more complex procedure information, making correlation learning quite challenging. To obtain procedural knowledge, current methods rely heavily on fine-grained step-level annotations, which are costly and non-scalable. To improve VCL on instructional videos, we introduce a weakly supervised framework named Collaborative Procedure Alignment (CPA). To be specific, our framework comprises two core components: the collaborative step mining (CSM) module and the frame-to-step alignment (FSA) module. Free of the necessity for step-level annotations, the CSM module can properly conduct temporal step segmentation and pseudo-step learning by exploring the inner procedure correspondences between paired videos. Subsequently, the FSA module efficiently yields the probability of aligning one video’s frame-level features with another video’s pseudo-step labels, which can act as a reliable correlation degree for paired videos. The two modules are inherently interconnected and can mutually enhance each other to extract the step-level knowledge and measure the video correlation distances accurately. Our framework provides an effective tool for instructional video correlation learning. We instantiate our framework on four representative tasks, including sequence verification, few-shot action recognition, temporal action segmentation, and action quality assessment. Furthermore, we extend our framework to more innovative functions to further exhibit its potential. Extensive and in-depth experiments validate CPA’s strong correlation learning capability on instructional videos. The implementation can be found at https://github.com/hotelll/Collaborative_Procedure_Alignment.

Abstract Image

查看原文本刊更多论文

从协作视角实现弱监督下的程序感知教学视频关联学习

视频相关学习（Video Correlation Learning，VCL）是一个高级研究领域，其核心是通过比较范式分析视频之间的语义和时间对应关系。最近，与教学视频相关的任务因其巨大的潜力而日益受到关注。与普通视频相比，教学视频拥有更复杂的程序信息，这使得关联学习具有相当大的挑战性。为了获取程序知识，目前的方法主要依赖于细粒度的步骤级注释，这种方法成本高且不可扩展。为了改进教学视频中的 VCL，我们引入了一个名为协作程序对齐（CPA）的弱监督框架。具体来说，我们的框架由两个核心部分组成：协作步骤挖掘（CSM）模块和帧到步骤对齐（FSA）模块。CSM 模块无需步骤级注释，通过探索配对视频之间的内部程序对应关系，可以正确地进行时间步骤分割和伪步骤学习。随后，FSA 模块能有效地得出一个视频的帧级特征与另一个视频的伪步骤标签的对齐概率，这可以作为配对视频的可靠相关度。这两个模块之间存在内在联系，可以相互促进，从而提取步骤级知识并准确测量视频相关距离。我们的框架为教学视频相关性学习提供了有效工具。我们在四个具有代表性的任务中实例化了我们的框架，包括序列验证、少镜头动作识别、时序动作分割和动作质量评估。此外，我们还将框架扩展到更多创新功能，以进一步展示其潜力。广泛而深入的实验验证了 CPA 在教学视频中强大的关联学习能力。具体实现方法请访问 https://github.com/hotelll/Collaborative_Procedure_Alignment。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.