基于权重的多流多模态视频问答模型

M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S
{"title":"基于权重的多流多模态视频问答模型","authors":"M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S","doi":"10.32473/flairs.36.133306","DOIUrl":null,"url":null,"abstract":"There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.","PeriodicalId":302103,"journal":{"name":"The International FLAIRS Conference Proceedings","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weight-based multi-stream model for Multi-Modal Video Question Answering\",\"authors\":\"M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S\",\"doi\":\"10.32473/flairs.36.133306\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.\",\"PeriodicalId\":302103,\"journal\":{\"name\":\"The International FLAIRS Conference Proceedings\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International FLAIRS Conference Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32473/flairs.36.133306\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International FLAIRS Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32473/flairs.36.133306","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在计算机视觉、自然语言处理和知识表示的各个领域都取得了巨大的成功。视频是一个丰富的信息源,它混合了图像、音频和可选字幕的多模态数据形式。目前的研究正在将这些单独的领域结合起来,这些领域已经产生了诸如图像字幕、视觉问答和视频问答等主题。视频问答是一个集目标检测与识别、时间信息处理、视觉注意、自然语言处理等研究课题于一体的模型。在本文中,我们提出了一个具有注意力机制的视频问答模型,该模型为视频包含的许多信息分配不同的权重。该模型将问题与3个流(即视频帧、字幕和对象)结合起来,以获得最可能的答案。该模型还接收一组候选答案作为输入,并预测其中一个作为最可能的答案,因为它已经在TVQA数据集上进行了训练和测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Weight-based multi-stream model for Multi-Modal Video Question Answering
There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信