Chenyu Cao, C. Yan, Fangtao Li, Zihe Liu, Z. Wang, Bin Wu
{"title":"通过时空和多模态线索识别视频中的人物和关系","authors":"Chenyu Cao, C. Yan, Fangtao Li, Zihe Liu, Z. Wang, Bin Wu","doi":"10.1109/ICKG52313.2021.00032","DOIUrl":null,"url":null,"abstract":"Video contains rich semantic knowledge of multiple modalities related to a person. Mining deep or potential semantic knowledge in the video could help artificial intelligence better understand the behavior and emotion of humans in the video. The researches for deep and context semantic knowledge in the video are few at present. Many researches on the knowledge mining of characters and visual relationships between humans still remain on static picture, lacking attention to the temporal visual features and other important modalities. In order to better mine the semantic knowledge in the video, we propose the novel Global-local VLAD (GL-VLAD) module, using the convolution of different scales to enlarge different receptive fields and extract the global and local information of features in the video. In addition, we propose a Multimodal Fusion Graph(MFG) to focus on the knowledge of different modalities, which can represent the general features in multi-modal video scenes. We use this method to conduct a large number of experiments of social relation extraction and person recognition on the dataset MovieGraphs and IQIYI- VID-2019. The accuracy and mAP respectively reach 90.23% and 89.87% on IQIYI-VID-2019. The accuracy achieves 56.13 % on the fine-grained dataset MovieGraphs for relation extraction task, while the person recognition of which has values 89.31 % and 85.24% on accuracy and mAP. The experimental results show that our proposed method has better performance than the state-of-the-art methods.","PeriodicalId":174126,"journal":{"name":"2021 IEEE International Conference on Big Knowledge (ICBK)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Recognizing Characters and Relationships from Videos via Spatial-Temporal and Multimodal Cues\",\"authors\":\"Chenyu Cao, C. Yan, Fangtao Li, Zihe Liu, Z. Wang, Bin Wu\",\"doi\":\"10.1109/ICKG52313.2021.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video contains rich semantic knowledge of multiple modalities related to a person. Mining deep or potential semantic knowledge in the video could help artificial intelligence better understand the behavior and emotion of humans in the video. The researches for deep and context semantic knowledge in the video are few at present. Many researches on the knowledge mining of characters and visual relationships between humans still remain on static picture, lacking attention to the temporal visual features and other important modalities. In order to better mine the semantic knowledge in the video, we propose the novel Global-local VLAD (GL-VLAD) module, using the convolution of different scales to enlarge different receptive fields and extract the global and local information of features in the video. In addition, we propose a Multimodal Fusion Graph(MFG) to focus on the knowledge of different modalities, which can represent the general features in multi-modal video scenes. We use this method to conduct a large number of experiments of social relation extraction and person recognition on the dataset MovieGraphs and IQIYI- VID-2019. The accuracy and mAP respectively reach 90.23% and 89.87% on IQIYI-VID-2019. The accuracy achieves 56.13 % on the fine-grained dataset MovieGraphs for relation extraction task, while the person recognition of which has values 89.31 % and 85.24% on accuracy and mAP. The experimental results show that our proposed method has better performance than the state-of-the-art methods.\",\"PeriodicalId\":174126,\"journal\":{\"name\":\"2021 IEEE International Conference on Big Knowledge (ICBK)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Big Knowledge (ICBK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICKG52313.2021.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Big Knowledge (ICBK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICKG52313.2021.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
视频包含与人相关的多种模态的丰富语义知识。挖掘视频中深层或潜在的语义知识可以帮助人工智能更好地理解视频中人类的行为和情感。目前对视频中深度和语境语义知识的研究还很少。许多关于人物和人与人之间的视觉关系的知识挖掘研究仍然停留在静态图像上,缺乏对时间视觉特征和其他重要模态的关注。为了更好地挖掘视频中的语义知识,我们提出了一种新的全局-局部VLAD (GL-VLAD)模块,利用不同尺度的卷积来扩大不同的感受域,提取视频中特征的全局和局部信息。此外,我们提出了一个多模态融合图(Multimodal Fusion Graph, MFG)来关注不同模态的知识,它可以代表多模态视频场景的一般特征。我们利用该方法在电影图和爱奇艺- VID-2019数据集上进行了大量的社会关系提取和人物识别实验。在爱奇艺- vid -2019上,准确率和mAP分别达到90.23%和89.87%。在细粒度数据MovieGraphs上进行关系提取的准确率达到56.13%,其中人物识别的准确率和mAP值分别为89.31%和85.24%。实验结果表明,该方法比现有方法具有更好的性能。
Recognizing Characters and Relationships from Videos via Spatial-Temporal and Multimodal Cues
Video contains rich semantic knowledge of multiple modalities related to a person. Mining deep or potential semantic knowledge in the video could help artificial intelligence better understand the behavior and emotion of humans in the video. The researches for deep and context semantic knowledge in the video are few at present. Many researches on the knowledge mining of characters and visual relationships between humans still remain on static picture, lacking attention to the temporal visual features and other important modalities. In order to better mine the semantic knowledge in the video, we propose the novel Global-local VLAD (GL-VLAD) module, using the convolution of different scales to enlarge different receptive fields and extract the global and local information of features in the video. In addition, we propose a Multimodal Fusion Graph(MFG) to focus on the knowledge of different modalities, which can represent the general features in multi-modal video scenes. We use this method to conduct a large number of experiments of social relation extraction and person recognition on the dataset MovieGraphs and IQIYI- VID-2019. The accuracy and mAP respectively reach 90.23% and 89.87% on IQIYI-VID-2019. The accuracy achieves 56.13 % on the fine-grained dataset MovieGraphs for relation extraction task, while the person recognition of which has values 89.31 % and 85.24% on accuracy and mAP. The experimental results show that our proposed method has better performance than the state-of-the-art methods.