基于注意的多视图变分自编码器的多模态网络嵌入

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval Pub Date : 2018-06-05 DOI:10.1145/3206025.3206035

Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, Zhonghua Zhao

{"title":"基于注意的多视图变分自编码器的多模态网络嵌入","authors":"Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, Zhonghua Zhao","doi":"10.1145/3206025.3206035","DOIUrl":null,"url":null,"abstract":"Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes containing multimodal contents and connected by heterogeneous relationships, such as social images containing multimodal contents (e.g., visual content and text description), and linked with various forms (e.g., in the same album or with the same tag). However, given the multimodal network, simply learning the embedding from the network structure or a subset of content results in sub-optimal representation. In this paper, we propose a novel deep embedding method, i.e., Attention-based Multi-view Variational Auto-Encoder (AMVAE), to incorporate both the link information and the multimodal contents for more effective and efficient embedding. Specifically, we adopt LSTM with attention model to learn the correlation between different data modalities, such as the correlation between visual regions and the specific words, to obtain the semantic embedding of the multimodal contents. Then, the link information and the semantic embedding are considered as two correlated views. A multi-view correlation learning based Variational Auto-Encoder (VAE) is proposed to learn the representation of each node, in which the embedding of link information and multimodal contents are integrated and mutually reinforced. Experiments on three real-world datasets demonstrate the superiority of the proposed model in two applications, i.e., multi-label classification and link prediction.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"191 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder\",\"authors\":\"Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, Zhonghua Zhao\",\"doi\":\"10.1145/3206025.3206035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes containing multimodal contents and connected by heterogeneous relationships, such as social images containing multimodal contents (e.g., visual content and text description), and linked with various forms (e.g., in the same album or with the same tag). However, given the multimodal network, simply learning the embedding from the network structure or a subset of content results in sub-optimal representation. In this paper, we propose a novel deep embedding method, i.e., Attention-based Multi-view Variational Auto-Encoder (AMVAE), to incorporate both the link information and the multimodal contents for more effective and efficient embedding. Specifically, we adopt LSTM with attention model to learn the correlation between different data modalities, such as the correlation between visual regions and the specific words, to obtain the semantic embedding of the multimodal contents. Then, the link information and the semantic embedding are considered as two correlated views. A multi-view correlation learning based Variational Auto-Encoder (VAE) is proposed to learn the representation of each node, in which the embedding of link information and multimodal contents are integrated and mutually reinforced. Experiments on three real-world datasets demonstrate the superiority of the proposed model in two applications, i.e., multi-label classification and link prediction.\",\"PeriodicalId\":224132,\"journal\":{\"name\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"volume\":\"191 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3206025.3206035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

摘要

学习社交媒体数据的嵌入引起了广泛的研究兴趣，并催生了许多应用，如分类和链接预测。在本文中，我们研究了一个多模态网络的场景，该网络的节点包含多模态内容，并通过异质关系连接，例如包含多模态内容(例如，视觉内容和文本描述)的社会图像，并与各种形式(例如，在同一相册中或使用相同的标签)链接。然而，对于多模态网络，简单地从网络结构或内容子集中学习嵌入会导致次优表示。本文提出了一种新颖的深度嵌入方法，即基于注意力的多视图变分自编码器(AMVAE)，该方法将链接信息和多模态内容融合在一起，以提高嵌入的有效性和效率。具体来说，我们采用LSTM结合注意模型来学习不同数据模态之间的相关性，例如视觉区域与特定单词之间的相关性，从而获得多模态内容的语义嵌入。然后，将链接信息和语义嵌入视为两个相关视图。提出了一种基于多视图相关学习的变分自编码器(VAE)来学习每个节点的表示，其中链接信息的嵌入与多模态内容的嵌入相结合并相互增强。在三个真实数据集上的实验证明了该模型在多标签分类和链接预测两种应用中的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder

Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes containing multimodal contents and connected by heterogeneous relationships, such as social images containing multimodal contents (e.g., visual content and text description), and linked with various forms (e.g., in the same album or with the same tag). However, given the multimodal network, simply learning the embedding from the network structure or a subset of content results in sub-optimal representation. In this paper, we propose a novel deep embedding method, i.e., Attention-based Multi-view Variational Auto-Encoder (AMVAE), to incorporate both the link information and the multimodal contents for more effective and efficient embedding. Specifically, we adopt LSTM with attention model to learn the correlation between different data modalities, such as the correlation between visual regions and the specific words, to obtain the semantic embedding of the multimodal contents. Then, the link information and the semantic embedding are considered as two correlated views. A multi-view correlation learning based Variational Auto-Encoder (VAE) is proposed to learn the representation of each node, in which the embedding of link information and multimodal contents are integrated and mutually reinforced. Experiments on three real-world datasets demonstrate the superiority of the proposed model in two applications, i.e., multi-label classification and link prediction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量