基于用户生成标签的图像标注多标签三重嵌入

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval Pub Date : 2018-06-05 DOI:10.1145/3206025.3206061

Zachary Seymour, Zhongfei Zhang

{"title":"基于用户生成标签的图像标注多标签三重嵌入","authors":"Zachary Seymour, Zhongfei Zhang","doi":"10.1145/3206025.3206061","DOIUrl":null,"url":null,"abstract":"This work studies the representational embedding of images and their corresponding annotations--in the form of tag metadata--such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Multi-label Triplet Embeddings for Image Annotation from User-Generated Tags\",\"authors\":\"Zachary Seymour, Zhongfei Zhang\",\"doi\":\"10.1145/3206025.3206061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work studies the representational embedding of images and their corresponding annotations--in the form of tag metadata--such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.\",\"PeriodicalId\":224132,\"journal\":{\"name\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3206025.3206061\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

这项工作研究了图像的表征嵌入及其相应的注释——以标签元数据的形式——这样，给定一种模态的原始数据，可以根据另一种模态的原始数据检索相应的语义描述。虽然卷积神经网络(cnn)在检测语义简单的场景或类别(即使许多这样的对象可能同时出现在图像中)方面已经在该领域得到了广泛而成功的应用，但这项工作接近于在嘈杂、用户生成和语义复杂的多标签背景下处理图像注释的任务，这些多标签广泛地从社交媒体网站上获得。在这种情况下，图像的标签是多种多样的、嘈杂的，并且通常与对象无关，而是描述性的或特定于用户的。此外，现有的使用这类数据的深度图像注释文献通常使用所谓的CNN-RNN框架，结合卷积和循环神经网络。我们讨论了为什么rnn在这种情况下可能不是最佳选择，尽管它们已经被证明在类似的字幕任务上表现良好。我们的模型利用潜在的图像-文本空间，通过使用三重损失框架来学习图像及其标签的联合嵌入空间，在存在多个潜在的正样本类的情况下。我们在几个图像标注数据集上展示了这些嵌入的表征特性的最新结果，以展示这种方法的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-label Triplet Embeddings for Image Annotation from User-Generated Tags

This work studies the representational embedding of images and their corresponding annotations--in the form of tag metadata--such that, given a piece of the raw data in one modality, the corresponding semantic description can be retrieved in terms of the raw data in another. While convolutional neural networks (CNNs) have been widely and successfully applied in this domain with regards to detecting semantically simple scenes or categories (even though many such objects may be simultaneously present in an image), this work approaches the task of dealing with image annotations in the context of noisy, user-generated, and semantically complex multi-labels, widely available from social media sites. In this case, the labels for an image are diverse, noisy, and often not specifically related to an object, but rather descriptive or user-specific. Furthermore, the existing deep image annotation literature using this type of data typically utilizes the so-called CNN-RNN framework, combining convolutional and recurrent neural networks. We offer a discussion of why RNNs may not be the best choice in this case, though they have been shown to perform well on the similar captioning tasks. Our model exploits the latent image-text space through the use of a triplet loss framework to learn a joint embedding space for the images and their tags, in the presence of multiple, potentially positive exemplar classes. We present state-of-the-art results of the representational properties of these embeddings on several image annotation datasets to show the promise of this approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量