Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat, Nauman Zafar, Omar Arif
{"title":"Image Caption Generator with Novel Object Injection","authors":"Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat, Nauman Zafar, Omar Arif","doi":"10.1109/DICTA.2018.8615810","DOIUrl":null,"url":null,"abstract":"Image captioning is a field within artificial intelligence that is progressing rapidly and it has a lot of potentials. A major problem when working in this field is the limited amount of data that is available to us as is. The only dataset considered suitable enough for the task is the Microsoft: Common Objects in Context (MSCOCO) dataset, which contains about 120,000 training images. This covers about 80 object classes, which is an insufficient amount if we want to create robust solutions that aren't limited to the constraints of the data at hand. In order to overcome this problem, we propose a solution that incorporates Zero-Shot Learning concepts in order to identify unknown objects and classes by using semantic word embeddings and existing state-of-the-art object identification algorithms. Our proposed model, Image Captioning using Novel Word Injection, uses a pre-trained caption generator and works on the output of the generator to inject objects that are not present in the dataset into the caption. We evaluate the model on standardized metrics, namely, BLEU, CIDEr and ROUGE-L. The results, qualitatively and quantitatively, outperform the underlying model.","PeriodicalId":130057,"journal":{"name":"2018 Digital Image Computing: Techniques and Applications (DICTA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA.2018.8615810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Image captioning is a field within artificial intelligence that is progressing rapidly and it has a lot of potentials. A major problem when working in this field is the limited amount of data that is available to us as is. The only dataset considered suitable enough for the task is the Microsoft: Common Objects in Context (MSCOCO) dataset, which contains about 120,000 training images. This covers about 80 object classes, which is an insufficient amount if we want to create robust solutions that aren't limited to the constraints of the data at hand. In order to overcome this problem, we propose a solution that incorporates Zero-Shot Learning concepts in order to identify unknown objects and classes by using semantic word embeddings and existing state-of-the-art object identification algorithms. Our proposed model, Image Captioning using Novel Word Injection, uses a pre-trained caption generator and works on the output of the generator to inject objects that are not present in the dataset into the caption. We evaluate the model on standardized metrics, namely, BLEU, CIDEr and ROUGE-L. The results, qualitatively and quantitatively, outperform the underlying model.
图像字幕是人工智能中发展迅速的一个领域,它具有很大的潜力。在这个领域工作时的一个主要问题是我们现有的可用数据量有限。唯一被认为足够适合这项任务的数据集是微软:上下文中的公共对象(MSCOCO)数据集,它包含大约120,000张训练图像。这涵盖了大约80个对象类,如果我们想要创建不受手头数据约束的健壮解决方案,那么这个数量是不够的。为了克服这个问题,我们提出了一个结合Zero-Shot学习概念的解决方案,通过使用语义词嵌入和现有的最先进的对象识别算法来识别未知对象和类。我们提出的模型,使用Novel Word Injection的图像字幕,使用预训练的标题生成器,并在生成器的输出上工作,将数据集中不存在的对象注入到标题中。我们在标准化指标上评估模型,即BLEU, CIDEr和ROUGE-L。结果,定性和定量,优于基础模型。