Doanh C. Bui, Truc Trinh, Nguyen D. Vo, Khang Nguyen
{"title":"An Augmented Embedding Spaces approach for Text-based Image Captioning","authors":"Doanh C. Bui, Truc Trinh, Nguyen D. Vo, Khang Nguyen","doi":"10.1109/NICS54270.2021.9701576","DOIUrl":null,"url":null,"abstract":"Scene text-based Image Captioning is the problem that generates caption for an input image using both contexts of image and scene text information. To improve the performance of this problem, in this paper, we propose two modules, Objects-augmented and Grid features augmentation, to enhance spatial location information and global information understanding in images based on M4C-Captioner architecture for text-based Image Captioning problems. Experimental results on the TextCaps dataset show that our method achieves superior performance compared with the M4C-Captioner baseline approach. Our highest result on the Standard Test set is 20.02% and 85.64% in the two metrics BLEU4 and CIDEr, respectively.","PeriodicalId":296963,"journal":{"name":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS54270.2021.9701576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Scene text-based Image Captioning is the problem that generates caption for an input image using both contexts of image and scene text information. To improve the performance of this problem, in this paper, we propose two modules, Objects-augmented and Grid features augmentation, to enhance spatial location information and global information understanding in images based on M4C-Captioner architecture for text-based Image Captioning problems. Experimental results on the TextCaps dataset show that our method achieves superior performance compared with the M4C-Captioner baseline approach. Our highest result on the Standard Test set is 20.02% and 85.64% in the two metrics BLEU4 and CIDEr, respectively.