{"title":"视频理解的多模态融合","authors":"A. Hoogs, J. Mundy, G. Cross","doi":"10.1109/AIPR.2001.991210","DOIUrl":null,"url":null,"abstract":"The exploitation of semantic information in computer vision problems can be difficult because of the large difference in representations and levels of knowledge. Image analysis is formulated in terms of low-level features describing image structure and intensity, while high-level knowledge such as purpose and common sense are encoded in abstract, non-geometric representations. In this work we attempt to bridge this gap through the integration of image analysis algorithms with WordNet, a large semantic network that explicitly links related words in a hierarchical structure. Our problem domain is the understanding of broadcast news, as this provides both linguistic information in the transcript and video information. Visual detection algorithms such as face detection and object tracking are applied to the video to extract basic object information, which is indexed into WordNet. The transcript provides topic information in the form of detected keywords. Together, both types of information are used to constrain a search within WordNet for a description of the video content in terms of the most likely WordNet concepts. This project is in its early stages; the general ideas and concepts are presented here.","PeriodicalId":277181,"journal":{"name":"Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time Varying Imagery","volume":"218 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Multi-modal fusion for video understanding\",\"authors\":\"A. Hoogs, J. Mundy, G. Cross\",\"doi\":\"10.1109/AIPR.2001.991210\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The exploitation of semantic information in computer vision problems can be difficult because of the large difference in representations and levels of knowledge. Image analysis is formulated in terms of low-level features describing image structure and intensity, while high-level knowledge such as purpose and common sense are encoded in abstract, non-geometric representations. In this work we attempt to bridge this gap through the integration of image analysis algorithms with WordNet, a large semantic network that explicitly links related words in a hierarchical structure. Our problem domain is the understanding of broadcast news, as this provides both linguistic information in the transcript and video information. Visual detection algorithms such as face detection and object tracking are applied to the video to extract basic object information, which is indexed into WordNet. The transcript provides topic information in the form of detected keywords. Together, both types of information are used to constrain a search within WordNet for a description of the video content in terms of the most likely WordNet concepts. This project is in its early stages; the general ideas and concepts are presented here.\",\"PeriodicalId\":277181,\"journal\":{\"name\":\"Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time Varying Imagery\",\"volume\":\"218 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time Varying Imagery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AIPR.2001.991210\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time Varying Imagery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIPR.2001.991210","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The exploitation of semantic information in computer vision problems can be difficult because of the large difference in representations and levels of knowledge. Image analysis is formulated in terms of low-level features describing image structure and intensity, while high-level knowledge such as purpose and common sense are encoded in abstract, non-geometric representations. In this work we attempt to bridge this gap through the integration of image analysis algorithms with WordNet, a large semantic network that explicitly links related words in a hierarchical structure. Our problem domain is the understanding of broadcast news, as this provides both linguistic information in the transcript and video information. Visual detection algorithms such as face detection and object tracking are applied to the video to extract basic object information, which is indexed into WordNet. The transcript provides topic information in the form of detected keywords. Together, both types of information are used to constrain a search within WordNet for a description of the video content in terms of the most likely WordNet concepts. This project is in its early stages; the general ideas and concepts are presented here.