{"title":"Video to Text Study using an Encoder-Decoder Networks Approach","authors":"Carlos Ismael Orozco, M. Buemi, J. Jacobo-Berlles","doi":"10.1109/SCCC.2018.8705254","DOIUrl":null,"url":null,"abstract":"The automatic generation of video description is currently a topic of interest in computer vision due to applications such as web indexation, video description for people with visual disabilities, among others. In this work we present a Neural Network architecture Encoder-Decoder. First, a Convolutional Neural Network 3D extracts the features of the input video. Then, an Long Short-Term Memory decodes the vector to automatically generate the description of the video. To perform the training and testing we use the Microsoft Video Description Corpus data set (MSVD). Evaluate the performance of our system using the challenge of COCO Image Captioning Challenge. We obtain as results 0.3984, 0.2941 and 0.5052 for the BLEU, METEOR and CIDEr metrics respectively. Competitive results compared with certificates in the bibliography.","PeriodicalId":235495,"journal":{"name":"2018 37th International Conference of the Chilean Computer Science Society (SCCC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 37th International Conference of the Chilean Computer Science Society (SCCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCCC.2018.8705254","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The automatic generation of video description is currently a topic of interest in computer vision due to applications such as web indexation, video description for people with visual disabilities, among others. In this work we present a Neural Network architecture Encoder-Decoder. First, a Convolutional Neural Network 3D extracts the features of the input video. Then, an Long Short-Term Memory decodes the vector to automatically generate the description of the video. To perform the training and testing we use the Microsoft Video Description Corpus data set (MSVD). Evaluate the performance of our system using the challenge of COCO Image Captioning Challenge. We obtain as results 0.3984, 0.2941 and 0.5052 for the BLEU, METEOR and CIDEr metrics respectively. Competitive results compared with certificates in the bibliography.