Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition

Q3 Computer Science

International Journal of Computing Pub Date : 2024-04-01 DOI:10.47839/ijc.23.1.3440

Alexander Zarichkovyi, Inna V. Stetsenko

{"title":"Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition","authors":"Alexander Zarichkovyi, Inna V. Stetsenko","doi":"10.47839/ijc.23.1.3440","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs), pretrained on expansive datasets containing image-text pairs, have exhibited remarkable transferability across a diverse spectrum of visual tasks. The leveraging of knowledge encoded within these potent VLMs holds significant promise for the advancement of effective video recognition models. A fundamental aspect of pretrained VLMs lies in their ability to establish a crucial bridge between the visual and textual domains. In our pioneering work, we introduce the Attr4Vis framework, dedicated to exploring knowledge transfer between Video and Text modalities to bolster video recognition performance. Central to our contributions is the comprehensive revisitation of Text-to-Video classifier initialization, a critical step that refines the initialization process and streamlines the integration of our framework, particularly within existing Vision-Language Models (VLMs). Furthermore, we emphasize the adoption of dense attribute generation techniques, shedding light on their paramount importance in video analysis. By effectively encoding attribute changes over time, these techniques significantly enhance event representation and recognition within videos. In addition, we introduce an innovative Attribute Enrichment Algorithm aimed at enriching set of attributes by large language models (LLMs) like ChatGPT. Through the seamless integration of these components, Attr4Vis attains a state-of-the-art accuracy of 91.5% on the challenging Kinetics-400 dataset using the InternVideo model.","PeriodicalId":37669,"journal":{"name":"International Journal of Computing","volume":"140 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47839/ijc.23.1.3440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-language models (VLMs), pretrained on expansive datasets containing image-text pairs, have exhibited remarkable transferability across a diverse spectrum of visual tasks. The leveraging of knowledge encoded within these potent VLMs holds significant promise for the advancement of effective video recognition models. A fundamental aspect of pretrained VLMs lies in their ability to establish a crucial bridge between the visual and textual domains. In our pioneering work, we introduce the Attr4Vis framework, dedicated to exploring knowledge transfer between Video and Text modalities to bolster video recognition performance. Central to our contributions is the comprehensive revisitation of Text-to-Video classifier initialization, a critical step that refines the initialization process and streamlines the integration of our framework, particularly within existing Vision-Language Models (VLMs). Furthermore, we emphasize the adoption of dense attribute generation techniques, shedding light on their paramount importance in video analysis. By effectively encoding attribute changes over time, these techniques significantly enhance event representation and recognition within videos. In addition, we introduce an innovative Attribute Enrichment Algorithm aimed at enriching set of attributes by large language models (LLMs) like ChatGPT. Through the seamless integration of these components, Attr4Vis attains a state-of-the-art accuracy of 91.5% on the challenging Kinetics-400 dataset using the InternVideo model.

查看原文本刊更多论文

Attr4Vis：重新审视视频识别视觉语言模型中属性分类的重要性

视觉语言模型（VLM）是在包含图像-文本对的大量数据集上进行预训练的，在各种视觉任务中表现出显著的可移植性。利用这些强大的视觉语言模型中编码的知识，有望推动有效视频识别模型的发展。预训练 VLM 的一个基本方面在于它们能够在视觉领域和文本领域之间架起一座重要的桥梁。在我们的开创性工作中，我们引入了 Attr4Vis 框架，致力于探索视频和文本模式之间的知识转移，以提高视频识别性能。我们的核心贡献是全面重新审视文本到视频分类器的初始化，这一关键步骤完善了初始化过程并简化了我们框架的整合，尤其是在现有的视觉语言模型（VLM）中。此外，我们还强调采用密集属性生成技术，阐明其在视频分析中的极端重要性。通过对随时间变化的属性进行有效编码，这些技术大大增强了视频中的事件表示和识别能力。此外，我们还引入了一种创新的属性丰富算法，旨在通过大型语言模型（LLM）（如 ChatGPT）来丰富属性集。通过无缝集成这些组件，Attr4Vis 利用 InternVideo 模型在具有挑战性的 Kinetics-400 数据集上达到了 91.5% 的一流准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computing Computer Science-Computer Science (miscellaneous)

CiteScore

2.20

自引率

0.00%

发文量

期刊介绍： The International Journal of Computing Journal was established in 2002 on the base of Branch Research Laboratory for Automated Systems and Networks, since 2005 it’s renamed as Research Institute of Intelligent Computer Systems. A goal of the Journal is to publish papers with the novel results in Computing Science and Computer Engineering and Information Technologies and Software Engineering and Information Systems within the Journal topics. The official language of the Journal is English; also papers abstracts in both Ukrainian and Russian languages are published there. The issues of the Journal are published quarterly. The Editorial Board consists of about 30 recognized worldwide scientists.