{"title":"基于卷积神经网络和多模态融合的文本辅助图像分类","authors":"Dongzhe Wang, K. Mao, G. Ng","doi":"10.23919/ICIF.2017.8009768","DOIUrl":null,"url":null,"abstract":"With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation of web text data. We firstly adopt convolutional neural network (CNN) for web text modeling on top of word vectors. Combined with CNN for image, we present a multimodal fusion to maximize the discriminative power of visual and textual modality data for decision level and feature level simultaneously. Experimental results show that the proposed framework achieves significant improvement in large-scale image classification on Pascal VOC-2007 and VOC-2012 datasets.","PeriodicalId":148407,"journal":{"name":"2017 20th International Conference on Information Fusion (Fusion)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Convolutional neural networks and multimodal fusion for text aided image classification\",\"authors\":\"Dongzhe Wang, K. Mao, G. Ng\",\"doi\":\"10.23919/ICIF.2017.8009768\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation of web text data. We firstly adopt convolutional neural network (CNN) for web text modeling on top of word vectors. Combined with CNN for image, we present a multimodal fusion to maximize the discriminative power of visual and textual modality data for decision level and feature level simultaneously. Experimental results show that the proposed framework achieves significant improvement in large-scale image classification on Pascal VOC-2007 and VOC-2012 datasets.\",\"PeriodicalId\":148407,\"journal\":{\"name\":\"2017 20th International Conference on Information Fusion (Fusion)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 20th International Conference on Information Fusion (Fusion)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/ICIF.2017.8009768\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th International Conference on Information Fusion (Fusion)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/ICIF.2017.8009768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Convolutional neural networks and multimodal fusion for text aided image classification
With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation of web text data. We firstly adopt convolutional neural network (CNN) for web text modeling on top of word vectors. Combined with CNN for image, we present a multimodal fusion to maximize the discriminative power of visual and textual modality data for decision level and feature level simultaneously. Experimental results show that the proposed framework achieves significant improvement in large-scale image classification on Pascal VOC-2007 and VOC-2012 datasets.