Understanding, Categorizing and Predicting Semantic Image-Text Relations

Proceedings of the 2019 on International Conference on Multimedia Retrieval Pub Date : 2019-06-05 DOI:10.1145/3323873.3325049

Christian Otto, Matthias Springstein, Avishek Anand, R. Ewerth

{"title":"Understanding, Categorizing and Predicting Semantic Image-Text Relations","authors":"Christian Otto, Matthias Springstein, Avishek Anand, R. Ewerth","doi":"10.1145/3323873.3325049","DOIUrl":null,"url":null,"abstract":"Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., \"illustration\" or \"anchorage\") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323873.3325049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

查看原文本刊更多论文

理解、分类和预测语义图像-文本关系

通常使用两种方式以互补和有益的方式传达信息，例如在线新闻，视频，教育资源或科学出版物。自动理解文本和相关图像之间的语义相关性以及它们之间的相互作用对于增强多模式网络搜索和推荐系统具有巨大的潜力。然而，多模态信息的自动理解仍然是一个未解决的研究问题。最近的方法，如图像字幕，专注于精确地描述视觉内容并将其翻译成文本，但通常既不解决语义解释，也不解决图像-文本星座的特定角色或目的。在本文中，我们超越了以往的工作，并在视觉传达研究的启发下，研究了多模态信息检索中有用的语义图像-文本关系。我们导出了八个语义图像-文本类的分类(例如，“插图”或“锚定”)，并展示了如何通过一组三个指标系统地表征它们:跨模态互信息、语义相关性和图像和文本的状态关系。此外，我们提出了一个深度学习系统，利用多模态嵌入来预测这些类。为了获得足够大的训练数据，我们已经自动收集和增强了来自各种数据集和网络资源的数据，这使得未来对该主题的研究成为可能。实验结果表明，该方法是可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2019 on International Conference on Multimedia Retrieval

自引率

0.00%

发文量