基于权重的多流多模态视频问答模型

The International FLAIRS Conference Proceedings Pub Date : 2023-05-08 DOI:10.32473/flairs.36.133306

M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S

{"title":"基于权重的多流多模态视频问答模型","authors":"M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S","doi":"10.32473/flairs.36.133306","DOIUrl":null,"url":null,"abstract":"There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.","PeriodicalId":302103,"journal":{"name":"The International FLAIRS Conference Proceedings","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weight-based multi-stream model for Multi-Modal Video Question Answering\",\"authors\":\"M. Rajesh, Sanjiv Sridhar, C. Kulkarni, Aaditya Shah, N. S\",\"doi\":\"10.32473/flairs.36.133306\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.\",\"PeriodicalId\":302103,\"journal\":{\"name\":\"The International FLAIRS Conference Proceedings\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International FLAIRS Conference Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32473/flairs.36.133306\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International FLAIRS Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32473/flairs.36.133306","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在计算机视觉、自然语言处理和知识表示的各个领域都取得了巨大的成功。视频是一个丰富的信息源，它混合了图像、音频和可选字幕的多模态数据形式。目前的研究正在将这些单独的领域结合起来，这些领域已经产生了诸如图像字幕、视觉问答和视频问答等主题。视频问答是一个集目标检测与识别、时间信息处理、视觉注意、自然语言处理等研究课题于一体的模型。在本文中，我们提出了一个具有注意力机制的视频问答模型，该模型为视频包含的许多信息分配不同的权重。该模型将问题与3个流(即视频帧、字幕和对象)结合起来，以获得最可能的答案。该模型还接收一组候选答案作为输入，并预测其中一个作为最可能的答案，因为它已经在TVQA数据集上进行了训练和测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Weight-based multi-stream model for Multi-Modal Video Question Answering

There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The International FLAIRS Conference Proceedings

自引率

0.00%

发文量