Semantic Attribute Enriched Storytelling from a Sequence of Images

2021 Digital Image Computing: Techniques and Applications (DICTA) Pub Date : 2021-11-01 DOI:10.1109/DICTA52665.2021.9647213

Zainy M. Malakan, G. Hassan, M. Jalwana, Nayyer Aafaq, A. Mian

{"title":"Semantic Attribute Enriched Storytelling from a Sequence of Images","authors":"Zainy M. Malakan, G. Hassan, M. Jalwana, Nayyer Aafaq, A. Mian","doi":"10.1109/DICTA52665.2021.9647213","DOIUrl":null,"url":null,"abstract":"Visual storytelling (VST) pertains to the task of generating story-based sentences from an ordered sequence of images. Contemporary techniques suffer from several limitations such as inadequate encapsulation of visual variance and context capturing among the input sequence. Consequently, generated story from such techniques often lacks coherence, context and semantic information. In this research, we devise a ‘Semantic Attribute Enriched Storytelling’ (SAES) framework to mitigate these issues. To that end, we first extract the visual features of input image sequence and the noun entities present in the visual input by employing an off-the-shelf object detector. The two features are concatenated to encapsulate the visual variance of the input sequence. The features are then passed through a Bidirectional-LSTM sequence encoder to capture the past and future context of the input image sequence followed by attention mechanism to enhance the discriminality of the input to language model i.e., mogrifier-LSTM. Additionally, we incorporate semantic attributes e.g., nouns to complement the semantic context in the generated story. Detailed experimental and human evaluations are performed to establish competitive performance of proposed technique. We achieve up 1.4% improvement on BLEU metric over the recent state-of-art methods.","PeriodicalId":424950,"journal":{"name":"2021 Digital Image Computing: Techniques and Applications (DICTA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA52665.2021.9647213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Visual storytelling (VST) pertains to the task of generating story-based sentences from an ordered sequence of images. Contemporary techniques suffer from several limitations such as inadequate encapsulation of visual variance and context capturing among the input sequence. Consequently, generated story from such techniques often lacks coherence, context and semantic information. In this research, we devise a ‘Semantic Attribute Enriched Storytelling’ (SAES) framework to mitigate these issues. To that end, we first extract the visual features of input image sequence and the noun entities present in the visual input by employing an off-the-shelf object detector. The two features are concatenated to encapsulate the visual variance of the input sequence. The features are then passed through a Bidirectional-LSTM sequence encoder to capture the past and future context of the input image sequence followed by attention mechanism to enhance the discriminality of the input to language model i.e., mogrifier-LSTM. Additionally, we incorporate semantic attributes e.g., nouns to complement the semantic context in the generated story. Detailed experimental and human evaluations are performed to establish competitive performance of proposed technique. We achieve up 1.4% improvement on BLEU metric over the recent state-of-art methods.

查看原文本刊更多论文

语义属性丰富的图像序列叙事

视觉叙事(VST)涉及到从有序的图像序列中生成基于故事的句子的任务。当前的技术存在一些局限性，如对视觉差异的封装不足和输入序列之间的上下文捕获。因此，通过这种技术生成的故事往往缺乏连贯性、语境和语义信息。在这项研究中，我们设计了一个“语义属性丰富的故事叙述”(SAES)框架来缓解这些问题。为此，我们首先通过使用现成的对象检测器提取输入图像序列的视觉特征和视觉输入中存在的名词实体。将这两个特征连接起来以封装输入序列的视觉变化。然后，这些特征通过双向lstm序列编码器来捕捉输入图像序列的过去和未来上下文，然后通过注意机制来增强输入语言模型的区别性，即mogrifier-LSTM。此外，我们还结合了语义属性，例如名词，以补充生成故事中的语义上下文。进行了详细的实验和人体评估，以建立所提出的技术的竞争性能。与最近的最先进的方法相比，我们在BLEU指标上提高了1.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Digital Image Computing: Techniques and Applications (DICTA)

自引率

0.00%

发文量