Tiny-VPS: Tiny Video Panoptic Segmentation Standing on the Shoulder of Giant-VPS

IF 2.7 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2025-06-20 DOI:10.1109/OJSP.2025.3581840

Qingfeng Liu;Mostafa El-Khamy;Kee-Bong Song

{"title":"Tiny-VPS: Tiny Video Panoptic Segmentation Standing on the Shoulder of Giant-VPS","authors":"Qingfeng Liu;Mostafa El-Khamy;Kee-Bong Song","doi":"10.1109/OJSP.2025.3581840","DOIUrl":null,"url":null,"abstract":"Video Panoptic Segmentation (VPS) is the most challenging video segmentation task, as it requires accurate labeling of every pixel in each frame, as well as identifying the multiple instances and tracking them across frames. In this paper, we explore state-of-the-art solutions for VPS at both the giant model regime for offline or server processing and the tiny model regime for online or edge computing. We designed Giant-VPS which achieved the first place solution in the 2024 Pixel Level Video Understanding in the Wild (PVUW) challenge. Our Giant-VPS builds on top of MinVIS and deploys the DINOv2-giant vision foundation model with a carefully designed ViT (Vision Transformer) adapter. For mobile and edge devices, we designed the Tiny-VPS model and show that our novel ViT-adapter distillation from the Giant-VPS model can further improve the accuracy of Tiny-VPS. Our Tiny-VPS is the first, in the sub-20 GFLOPS regime, to achieve competitive accuracy on VPS and VSS (Video Semantic Segmentation) benchmarks.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"803-814"},"PeriodicalIF":2.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11045393","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11045393/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video Panoptic Segmentation (VPS) is the most challenging video segmentation task, as it requires accurate labeling of every pixel in each frame, as well as identifying the multiple instances and tracking them across frames. In this paper, we explore state-of-the-art solutions for VPS at both the giant model regime for offline or server processing and the tiny model regime for online or edge computing. We designed Giant-VPS which achieved the first place solution in the 2024 Pixel Level Video Understanding in the Wild (PVUW) challenge. Our Giant-VPS builds on top of MinVIS and deploys the DINOv2-giant vision foundation model with a carefully designed ViT (Vision Transformer) adapter. For mobile and edge devices, we designed the Tiny-VPS model and show that our novel ViT-adapter distillation from the Giant-VPS model can further improve the accuracy of Tiny-VPS. Our Tiny-VPS is the first, in the sub-20 GFLOPS regime, to achieve competitive accuracy on VPS and VSS (Video Semantic Segmentation) benchmarks.

查看原文本刊更多论文

Tiny- vps：站在Giant-VPS肩膀上的微型视频全景分割

视频全光学分割（VPS）是最具挑战性的视频分割任务，因为它需要准确标记每帧中的每个像素，以及识别多个实例并跨帧跟踪它们。在本文中，我们探索了最先进的VPS解决方案，包括用于离线或服务器处理的大型模型体系和用于在线或边缘计算的小型模型体系。我们设计的Giant-VPS在2024年像素级野外视频理解（PVUW）挑战赛中获得了第一名的解决方案。我们的Giant-VPS构建在MinVIS之上，并使用精心设计的ViT（视觉变压器）适配器部署DINOv2-giant视觉基础模型。对于移动和边缘设备，我们设计了Tiny-VPS模型，并表明我们从Giant-VPS模型中提取的新型vitv适配器可以进一步提高Tiny-VPS的精度。我们的Tiny-VPS是第一个在低于20 GFLOPS的情况下，在VPS和VSS（视频语义分割）基准上达到具有竞争力的准确性的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊