S. Guler, Ashutosh Morde, Ian A. Pushee, Xiang Ma, Jason A. Silverstein, S. McAuliffe
{"title":"Contextual video clip classification","authors":"S. Guler, Ashutosh Morde, Ian A. Pushee, Xiang Ma, Jason A. Silverstein, S. McAuliffe","doi":"10.1109/AIPR.2012.6528196","DOIUrl":null,"url":null,"abstract":"Content based classification of unrestricted video clips from various sources plays an important role in video analysis and search. Thus far automated video understanding research focused on videos from sources such as aerial, broadcast, meeting room etc. For each of these video sources certain assumptions are made which constrain the problem of content analysis. None of these assumptions hold for analyzing the contents of unrestricted videos. We present a top down approach to content based video classification by first understanding the overall scene structure and then detecting the actors, actions and objects along with the context they interact in as well as the global motion information from the scene. A scene in a video clip is used as a semantic unit providing the visual context and the location characteristics such as indoor, outdoor and type of each associated with the scene. The location context is tied with the video shooting style of zooming in and out to create a scene description hierarchy. Actors are considered as detected people and faces, certain poses of people help define the action and activities, while objects relevant to certain types of events provide additional context. Summary features are created for the scene semantic units based on the actors, actions, object detections and the context. These features were successfully used to train an asymmetric Random Forest classifier for video event classification. The top down approach we present here has the inherent advantage of being able to describe the video in addition to providing content based classification. The approach was tested on the Multimedia Event Detection (MED) 2011 dataset with promising results.","PeriodicalId":406942,"journal":{"name":"2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIPR.2012.6528196","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Content based classification of unrestricted video clips from various sources plays an important role in video analysis and search. Thus far automated video understanding research focused on videos from sources such as aerial, broadcast, meeting room etc. For each of these video sources certain assumptions are made which constrain the problem of content analysis. None of these assumptions hold for analyzing the contents of unrestricted videos. We present a top down approach to content based video classification by first understanding the overall scene structure and then detecting the actors, actions and objects along with the context they interact in as well as the global motion information from the scene. A scene in a video clip is used as a semantic unit providing the visual context and the location characteristics such as indoor, outdoor and type of each associated with the scene. The location context is tied with the video shooting style of zooming in and out to create a scene description hierarchy. Actors are considered as detected people and faces, certain poses of people help define the action and activities, while objects relevant to certain types of events provide additional context. Summary features are created for the scene semantic units based on the actors, actions, object detections and the context. These features were successfully used to train an asymmetric Random Forest classifier for video event classification. The top down approach we present here has the inherent advantage of being able to describe the video in addition to providing content based classification. The approach was tested on the Multimedia Event Detection (MED) 2011 dataset with promising results.