高效视频理解的压缩视觉

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision Pub Date : 2022-10-06 DOI:10.48550/arXiv.2210.02995

Olivia Wiles, J. Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

{"title":"高效视频理解的压缩视觉","authors":"Olivia Wiles, J. Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski","doi":"10.48550/arXiv.2210.02995","DOIUrl":null,"url":null,"abstract":"Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Compressed Vision for Efficient Video Understanding\",\"authors\":\"Olivia Wiles, J. Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski\",\"doi\":\"10.48550/arXiv.2210.02995\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.\",\"PeriodicalId\":87238,\"journal\":{\"name\":\"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.02995\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.02995","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

经验和推理发生在多个时间尺度上:毫秒、秒、小时或天。然而，绝大多数计算机视觉研究仍然集中在只有几秒钟的单个图像或短视频上。这是因为处理较长的视频甚至需要更多的可伸缩方法来处理它们。在这项工作中，我们提出了一个框架，可以使用相同的硬件来研究一小时长的视频，现在可以处理第二小时长的视频。我们用神经压缩取代了标准的视频压缩，例如JPEG，并表明我们可以直接将压缩的视频作为常规视频网络的输入。在压缩视频上操作可以提高所有管道级别的效率-数据传输，速度和内存-使得可以在更长的视频上更快地训练模型。然而，处理压缩信号的缺点是，如果单纯地进行处理，就会排除标准增强技术。我们通过引入一个小型网络来解决这个问题，该网络可以将转换应用于与原始视频空间中常用增强相对应的潜在代码。我们证明，通过我们的压缩视觉管道，我们可以在Kinetics600和COIN等流行基准上更有效地训练视频模型。我们还在标准帧率下对长达一小时的视频定义新任务进行概念验证实验。如果不使用压缩表示，处理如此长的视频是不可能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Compressed Vision for Efficient Video Understanding

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

自引率

0.00%

发文量