Deep-learning-based image captioning:analysis and prospects

Q3 Computer Science

中国图象图形学报 Pub Date : 2023-01-01 DOI:10.11834/jig.220660

Zhao Yongqiang, Jin Zhi, Zhang Feng, Zhao Haiyan, Tao Zhengwei, Dou Chengfeng, Xu Xinhai, Liu Donghong

{"title":"Deep-learning-based image captioning:analysis and prospects","authors":"Zhao Yongqiang, Jin Zhi, Zhang Feng, Zhao Haiyan, Tao Zhengwei, Dou Chengfeng, Xu Xinhai, Liu Donghong","doi":"10.11834/jig.220660","DOIUrl":null,"url":null,"abstract":"图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究趋势进行预测和展望。;The task of image captioning is to use a computer in automatically generating a complete, smooth, and suitable corresponding scene's caption for a known image and realizing the multimodal conversion from image to text.Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence, which has a wide range of applications in research and production.Image captioning can be applied to many aspects of social development, such as text captions of images and videos, visual question answering, storytelling by looking at the image, network image analysis, and keyword search of an image.Image captions can also assist individuals born with visual impairments, making the computer another pair of eyes for them.The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology.On the basis of extensive literature research we find that image captioning algorithms based on deep learning still have key technical challenges, i.e., delivering rich feature information, solving the problem of exposure bias, generating the diversity of image captions, realizing the controllability of image captions, and improving the inference speed of image captions.The main framework of the image captioning model is the encoder-decoder architecture.First, the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector.Then, a decoder converts the fixed-length feature vector into an image caption.Therefore, the richer the feature information contained in the model is, the higher the accuracy of the model is, and the better the generation effect of the image caption is.According to the different research ideas of the existing algorithms, the present study reviews image captioning algorithms that deliver rich feature information from three aspects:attention mechanism, pretraining model, and multimodal model.Many image captioning algorithms cannot synchronize the training and prediction processes of a model.Thus, the model obtains exposure bias.When the model has an exposure bias, errors accumulate during word generation.Thus, the following words become biased, seriously affecting the accuracy of the image captioning model.According to different problem-solving methods, the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives:reinforcement learning, nonautoregressive model, and curriculum learning and scheduled sampling.Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image.The existing image captioning methods use common high-frequency expressions to generate relatively safety sentences.The caption results are relatively simple, empty, and lack critical detailed information, easily causing a lack of diversity in image captions.According to different research ideas, the present study reviews the existing image captioning methods of generative diversity from three aspects:graph convolutional neural network, generative adversarial network, and data augmentation.The majority of current image captioning models lack controllability, differentiating them from human intelligence.Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation, which is mainly divided into two categories:content-controlled image captions and style-controlled image captions.Content-controlled image captions aim to control the described image content, such as different areas or objects of the image.Thus, the model can describe the image content in which the users are interested.Style-controlled image captions aim to generate captions of different styles, such as humorous, romantic, and antique.In this study, the related algorithms of content-controlled and style-controlled image captions are reviewed.The existing image captioning models are mostly encoder-decoder architectures.The encoder stage uses a convolutional neural network-based visual feature extraction method, whereas the decoder stage uses a recurrent neural network-based method.According to the different existing research ideas, the methods for improving the inference speed of image captioning models are divided into three categories.The first category uses nonautoregressive models to improve the inference speed.The second category uses the grid-based visual feature method to improve the inference speed.The third category uses a convolutional-neural-network-based decoder to improve inference speed.In addition, this study provides a detailed introduction to general datasets and evaluation metrics in image captioning.General datasets mainly include the following:bilingual evaluation understudy(BLEU);recall-oriented understanding for gisting evaluation(ROUGE);metric for evaluation of translation with explicit ordering(METEOR);consensus-based image description evaluation(CIDEr);semantic propositional image caption evaluation(SPICE);Compact bilinear pooling;Text-to-image grounding for image caption evaluation;Relevance, extraness, omission;Fidelity and adequacy ensured.The evaluation metrics mainly include Flickr8K, Flickr30K, MS COCO(Microsoft common objects in context), TextCaps, Localized Narratives, and Nocaps.Finally, this study deeply discusses the problems to be solved and the future research direction in the field of image captioning, i.e., how to improve the performance of visual feature extraction in image captions, how to improve the diversity of image captions, how to improve the interpretability of deep learning models, how to realize the transfer between multiple languages in image captions, how to automatically generate or design the optimal network architecture, and how to study the datasets and evaluation metrics that are suitable for image captions.Image captioning research is a popular hot spot in computer vision and natural language processing.At present, many algorithms for solving different problems are proposed annually.Other research directions will be developed in the future.","PeriodicalId":36336,"journal":{"name":"中国图象图形学报","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"中国图象图形学报","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11834/jig.220660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

图像描述任务是利用计算机自动为已知图像生成一个完整、通顺、适用于对应场景的描述语句,实现从图像到文本的跨模态转换。随着深度学习技术的广泛应用,图像描述算法的精确度和推理速度都得到了极大提升。本文在广泛文献调研的基础上,将基于深度学习的图像描述算法研究分为两个层面,一是图像描述的基本能力构建,二是图像描述的应用有效性研究。这两个层面又可以细分为传递更加丰富的特征信息、解决暴露偏差问题、生成多样性的图像描述、实现图像描述的可控性和提升图像描述推理速度等核心技术挑战。针对上述层面所对应的挑战,本文从注意力机制、预训练模型和多模态模型的角度分析了传递更加丰富的特征信息的方法,从强化学习、非自回归模型和课程学习与计划采样的角度分析了解决暴露偏差问题的方法,从图卷积神经网络、生成对抗网络和数据增强的角度分析了生成多样性的图像描述的方法,从内容控制和风格控制的角度分析了图像描述可控性的方法,从非自回归模型、基于网格的视觉特征和基于卷积神经网络解码器的角度分析了提升图像描述推理速度的方法。此外,本文还对图像描述领域的通用数据集、评价指标和已有算法性能进行了详细介绍,并对图像描述中待解决的问题与未来研究趋势进行预测和展望。;The task of image captioning is to use a computer in automatically generating a complete, smooth, and suitable corresponding scene's caption for a known image and realizing the multimodal conversion from image to text.Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence, which has a wide range of applications in research and production.Image captioning can be applied to many aspects of social development, such as text captions of images and videos, visual question answering, storytelling by looking at the image, network image analysis, and keyword search of an image.Image captions can also assist individuals born with visual impairments, making the computer another pair of eyes for them.The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology.On the basis of extensive literature research we find that image captioning algorithms based on deep learning still have key technical challenges, i.e., delivering rich feature information, solving the problem of exposure bias, generating the diversity of image captions, realizing the controllability of image captions, and improving the inference speed of image captions.The main framework of the image captioning model is the encoder-decoder architecture.First, the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector.Then, a decoder converts the fixed-length feature vector into an image caption.Therefore, the richer the feature information contained in the model is, the higher the accuracy of the model is, and the better the generation effect of the image caption is.According to the different research ideas of the existing algorithms, the present study reviews image captioning algorithms that deliver rich feature information from three aspects:attention mechanism, pretraining model, and multimodal model.Many image captioning algorithms cannot synchronize the training and prediction processes of a model.Thus, the model obtains exposure bias.When the model has an exposure bias, errors accumulate during word generation.Thus, the following words become biased, seriously affecting the accuracy of the image captioning model.According to different problem-solving methods, the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives:reinforcement learning, nonautoregressive model, and curriculum learning and scheduled sampling.Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image.The existing image captioning methods use common high-frequency expressions to generate relatively safety sentences.The caption results are relatively simple, empty, and lack critical detailed information, easily causing a lack of diversity in image captions.According to different research ideas, the present study reviews the existing image captioning methods of generative diversity from three aspects:graph convolutional neural network, generative adversarial network, and data augmentation.The majority of current image captioning models lack controllability, differentiating them from human intelligence.Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation, which is mainly divided into two categories:content-controlled image captions and style-controlled image captions.Content-controlled image captions aim to control the described image content, such as different areas or objects of the image.Thus, the model can describe the image content in which the users are interested.Style-controlled image captions aim to generate captions of different styles, such as humorous, romantic, and antique.In this study, the related algorithms of content-controlled and style-controlled image captions are reviewed.The existing image captioning models are mostly encoder-decoder architectures.The encoder stage uses a convolutional neural network-based visual feature extraction method, whereas the decoder stage uses a recurrent neural network-based method.According to the different existing research ideas, the methods for improving the inference speed of image captioning models are divided into three categories.The first category uses nonautoregressive models to improve the inference speed.The second category uses the grid-based visual feature method to improve the inference speed.The third category uses a convolutional-neural-network-based decoder to improve inference speed.In addition, this study provides a detailed introduction to general datasets and evaluation metrics in image captioning.General datasets mainly include the following:bilingual evaluation understudy(BLEU);recall-oriented understanding for gisting evaluation(ROUGE);metric for evaluation of translation with explicit ordering(METEOR);consensus-based image description evaluation(CIDEr);semantic propositional image caption evaluation(SPICE);Compact bilinear pooling;Text-to-image grounding for image caption evaluation;Relevance, extraness, omission;Fidelity and adequacy ensured.The evaluation metrics mainly include Flickr8K, Flickr30K, MS COCO(Microsoft common objects in context), TextCaps, Localized Narratives, and Nocaps.Finally, this study deeply discusses the problems to be solved and the future research direction in the field of image captioning, i.e., how to improve the performance of visual feature extraction in image captions, how to improve the diversity of image captions, how to improve the interpretability of deep learning models, how to realize the transfer between multiple languages in image captions, how to automatically generate or design the optimal network architecture, and how to study the datasets and evaluation metrics that are suitable for image captions.Image captioning research is a popular hot spot in computer vision and natural language processing.At present, many algorithms for solving different problems are proposed annually.Other research directions will be developed in the future.

查看原文本刊更多论文

基于深度学习的图像字幕:分析与展望

第二类采用基于网格的视觉特征方法来提高推理速度。第三类使用基于卷积神经网络的解码器来提高推理速度。此外，本研究还详细介绍了图像字幕的一般数据集和评估指标。通用数据集主要包括:双语评价替代研究(BLEU)、面向回忆的理解标注评价(ROUGE)、明确排序翻译评价度量(METEOR)、基于共识的图像描述评价(CIDEr)、语义主旨图像标题评价(SPICE)、紧凑双线性池、图像标题评价的文本-图像基础、相关性、外因性、遗漏性、保证忠实度和充分性。评估指标主要包括Flickr8K、Flickr30K、MS COCO(微软公共对象)、TextCaps、localization Narratives和Nocaps。最后，本研究深入探讨了图像字幕领域有待解决的问题和未来的研究方向，即如何提高图像字幕中视觉特征提取的性能，如何提高图像字幕的多样性，如何提高深度学习模型的可解释性，如何实现图像字幕中多语言之间的迁移，如何自动生成或设计最优的网络架构，以及如何研究适合图像标题的数据集和评价指标。图像字幕是计算机视觉和自然语言处理领域的研究热点。目前，每年都有许多算法被提出来解决不同的问题。其他研究方向将在未来发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

中国图象图形学报 Computer Science-Computer Graphics and Computer-Aided Design

CiteScore

1.20

自引率

0.00%

发文量

6776

期刊介绍： Journal of Image and Graphics (ISSN 1006-8961, CN 11-3758/TB, CODEN ZTTXFZ) is an authoritative academic journal supervised by the Chinese Academy of Sciences and co-sponsored by the Institute of Space and Astronautical Information Innovation of the Chinese Academy of Sciences (ISIAS), the Chinese Society of Image and Graphics (CSIG), and the Beijing Institute of Applied Physics and Computational Mathematics (BIAPM). The journal integrates high-tech theories, technical methods and industrialisation of applied research results in computer image graphics, and mainly publishes innovative and high-level scientific research papers on basic and applied research in image graphics science and its closely related fields. The form of papers includes reviews, technical reports, project progress, academic news, new technology reviews, new product introduction and industrialisation research. The content covers a wide range of fields such as image analysis and recognition, image understanding and computer vision, computer graphics, virtual reality and augmented reality, system simulation, animation, etc., and theme columns are opened according to the research hotspots and cutting-edge topics. Journal of Image and Graphics reaches a wide range of readers, including scientific and technical personnel, enterprise supervisors, and postgraduates and college students of colleges and universities engaged in the fields of national defence, military, aviation, aerospace, communications, electronics, automotive, agriculture, meteorology, environmental protection, remote sensing, mapping, oil field, construction, transportation, finance, telecommunications, education, medical care, film and television, and art. Journal of Image and Graphics is included in many important domestic and international scientific literature database systems, including EBSCO database in the United States, JST database in Japan, Scopus database in the Netherlands, China Science and Technology Thesis Statistics and Analysis (Annual Research Report), China Science Citation Database (CSCD), China Academic Journal Network Publishing Database (CAJD), and China Academic Journal Network Publishing Database (CAJD). China Science Citation Database (CSCD), China Academic Journals Network Publishing Database (CAJD), China Academic Journal Abstracts, Chinese Science Abstracts (Series A), China Electronic Science Abstracts, Chinese Core Journals Abstracts, Chinese Academic Journals on CD-ROM, and China Academic Journals Comprehensive Evaluation Database.