{"title":"Attr4Vis: Revisiting Importance of Attribute Classification in Vision-Language Models for Video Recognition","authors":"Alexander Zarichkovyi, Inna V. Stetsenko","doi":"10.47839/ijc.23.1.3440","DOIUrl":null,"url":null,"abstract":"Vision-language models (VLMs), pretrained on expansive datasets containing image-text pairs, have exhibited remarkable transferability across a diverse spectrum of visual tasks. The leveraging of knowledge encoded within these potent VLMs holds significant promise for the advancement of effective video recognition models. A fundamental aspect of pretrained VLMs lies in their ability to establish a crucial bridge between the visual and textual domains. In our pioneering work, we introduce the Attr4Vis framework, dedicated to exploring knowledge transfer between Video and Text modalities to bolster video recognition performance. Central to our contributions is the comprehensive revisitation of Text-to-Video classifier initialization, a critical step that refines the initialization process and streamlines the integration of our framework, particularly within existing Vision-Language Models (VLMs). Furthermore, we emphasize the adoption of dense attribute generation techniques, shedding light on their paramount importance in video analysis. By effectively encoding attribute changes over time, these techniques significantly enhance event representation and recognition within videos. In addition, we introduce an innovative Attribute Enrichment Algorithm aimed at enriching set of attributes by large language models (LLMs) like ChatGPT. Through the seamless integration of these components, Attr4Vis attains a state-of-the-art accuracy of 91.5% on the challenging Kinetics-400 dataset using the InternVideo model.","PeriodicalId":37669,"journal":{"name":"International Journal of Computing","volume":"140 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47839/ijc.23.1.3440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-language models (VLMs), pretrained on expansive datasets containing image-text pairs, have exhibited remarkable transferability across a diverse spectrum of visual tasks. The leveraging of knowledge encoded within these potent VLMs holds significant promise for the advancement of effective video recognition models. A fundamental aspect of pretrained VLMs lies in their ability to establish a crucial bridge between the visual and textual domains. In our pioneering work, we introduce the Attr4Vis framework, dedicated to exploring knowledge transfer between Video and Text modalities to bolster video recognition performance. Central to our contributions is the comprehensive revisitation of Text-to-Video classifier initialization, a critical step that refines the initialization process and streamlines the integration of our framework, particularly within existing Vision-Language Models (VLMs). Furthermore, we emphasize the adoption of dense attribute generation techniques, shedding light on their paramount importance in video analysis. By effectively encoding attribute changes over time, these techniques significantly enhance event representation and recognition within videos. In addition, we introduce an innovative Attribute Enrichment Algorithm aimed at enriching set of attributes by large language models (LLMs) like ChatGPT. Through the seamless integration of these components, Attr4Vis attains a state-of-the-art accuracy of 91.5% on the challenging Kinetics-400 dataset using the InternVideo model.
期刊介绍:
The International Journal of Computing Journal was established in 2002 on the base of Branch Research Laboratory for Automated Systems and Networks, since 2005 it’s renamed as Research Institute of Intelligent Computer Systems. A goal of the Journal is to publish papers with the novel results in Computing Science and Computer Engineering and Information Technologies and Software Engineering and Information Systems within the Journal topics. The official language of the Journal is English; also papers abstracts in both Ukrainian and Russian languages are published there. The issues of the Journal are published quarterly. The Editorial Board consists of about 30 recognized worldwide scientists.