Kommineni Jenni, M. Srinivas, Roshni Sannapu, Murukessan Perumal
{"title":"视频问答","authors":"Kommineni Jenni, M. Srinivas, Roshni Sannapu, Murukessan Perumal","doi":"10.1109/SSP53291.2023.10207954","DOIUrl":null,"url":null,"abstract":"Convolutional networks are a key component of many computer vision applications. However, convolutions have a serious flaw. It only works in a small area, hence it lacks global information. The Attention method, on the other hand, is a new improvement in capturing long range interactions that has mostly been used to sequence modeling and generative modeling tasks. As an alternative to convolutions, we investigate the use of convolutions with an attention mechanism in a video question answering task. We present a unique self-attention mechanism based on convolutions that outperforms convolutions in the video question answering task. We discovered that combining convolutions with self-attention produces the greatest outcomes in experiments. As a result, we propose a hybrid idea, which combines convolutional operators with the self-attention mechanism. We combine convolutional feature maps with self-attention feature maps. Experiments show that convolution with self-attention improves video question answering tasks on the MSRVTT-QA dataset.","PeriodicalId":296346,"journal":{"name":"2023 IEEE Statistical Signal Processing Workshop (SSP)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CSA-BERT: Video Question Answering\",\"authors\":\"Kommineni Jenni, M. Srinivas, Roshni Sannapu, Murukessan Perumal\",\"doi\":\"10.1109/SSP53291.2023.10207954\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional networks are a key component of many computer vision applications. However, convolutions have a serious flaw. It only works in a small area, hence it lacks global information. The Attention method, on the other hand, is a new improvement in capturing long range interactions that has mostly been used to sequence modeling and generative modeling tasks. As an alternative to convolutions, we investigate the use of convolutions with an attention mechanism in a video question answering task. We present a unique self-attention mechanism based on convolutions that outperforms convolutions in the video question answering task. We discovered that combining convolutions with self-attention produces the greatest outcomes in experiments. As a result, we propose a hybrid idea, which combines convolutional operators with the self-attention mechanism. We combine convolutional feature maps with self-attention feature maps. Experiments show that convolution with self-attention improves video question answering tasks on the MSRVTT-QA dataset.\",\"PeriodicalId\":296346,\"journal\":{\"name\":\"2023 IEEE Statistical Signal Processing Workshop (SSP)\",\"volume\":\"123 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Statistical Signal Processing Workshop (SSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSP53291.2023.10207954\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Statistical Signal Processing Workshop (SSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSP53291.2023.10207954","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Convolutional networks are a key component of many computer vision applications. However, convolutions have a serious flaw. It only works in a small area, hence it lacks global information. The Attention method, on the other hand, is a new improvement in capturing long range interactions that has mostly been used to sequence modeling and generative modeling tasks. As an alternative to convolutions, we investigate the use of convolutions with an attention mechanism in a video question answering task. We present a unique self-attention mechanism based on convolutions that outperforms convolutions in the video question answering task. We discovered that combining convolutions with self-attention produces the greatest outcomes in experiments. As a result, we propose a hybrid idea, which combines convolutional operators with the self-attention mechanism. We combine convolutional feature maps with self-attention feature maps. Experiments show that convolution with self-attention improves video question answering tasks on the MSRVTT-QA dataset.