Large-Scale Training Framework for Video Annotation

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining Pub Date : 2019-07-25 DOI:10.1145/3292500.3330653

Seong Jae Hwang, Joonseok Lee, Balakrishnan Varadarajan, A. Gordon, Zheng Xu, A. Natsev

{"title":"Large-Scale Training Framework for Video Annotation","authors":"Seong Jae Hwang, Joonseok Lee, Balakrishnan Varadarajan, A. Gordon, Zheng Xu, A. Natsev","doi":"10.1145/3292500.3330653","DOIUrl":null,"url":null,"abstract":"Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"2175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3292500.3330653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.

查看原文本刊更多论文

视频标注的大规模训练框架

视频是网上最丰富的信息来源之一，但从互联网规模的视频内容中提取深刻的见解仍然是一个悬而未决的问题，无论是在理解的深度和广度方面，还是在规模方面。在过去的几年中，由于大规模视频数据集的可用性以及图像、音频和视频建模架构的核心进步，视频理解领域取得了巨大的进步。然而，小规模数据集上的最先进架构在互联网规模上部署通常是不切实际的，无论是在数亿个视频上训练这种深度网络的能力，还是在数十亿个视频上部署它们进行推理的能力。在本文中，我们提出了一个基于mapreduce的训练框架，该框架利用数据并行性和模型并行性来扩展复杂视频模型的训练。提出的框架使用交替优化和全批微调，并支持具有数十万个混合物的大型混合专家分类器，这使得模型深度和宽度之间的权衡，以及在共享(泛化)层和每个类(专门化)层之间转换模型容量的能力。我们证明了所提出的框架能够在最大的公共视频数据集(YouTube-8M和Sports-1M)上达到最先进的性能，并且可以扩展到100倍大的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量