{"title":"Convolutional neural networks and multimodal fusion for text aided image classification","authors":"Dongzhe Wang, K. Mao, G. Ng","doi":"10.23919/ICIF.2017.8009768","DOIUrl":null,"url":null,"abstract":"With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation of web text data. We firstly adopt convolutional neural network (CNN) for web text modeling on top of word vectors. Combined with CNN for image, we present a multimodal fusion to maximize the discriminative power of visual and textual modality data for decision level and feature level simultaneously. Experimental results show that the proposed framework achieves significant improvement in large-scale image classification on Pascal VOC-2007 and VOC-2012 datasets.","PeriodicalId":148407,"journal":{"name":"2017 20th International Conference on Information Fusion (Fusion)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th International Conference on Information Fusion (Fusion)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/ICIF.2017.8009768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19
Abstract
With the exponential growth of web meta-data, exploiting multimodal online sources via standard search engine has become a trend in visual recognition as it effectively alleviates the shortage of training data. However, the web meta-data such as text data is usually not as cooperative as expected due to its unstructured nature. To address this problem, this paper investigates the numerical representation of web text data. We firstly adopt convolutional neural network (CNN) for web text modeling on top of word vectors. Combined with CNN for image, we present a multimodal fusion to maximize the discriminative power of visual and textual modality data for decision level and feature level simultaneously. Experimental results show that the proposed framework achieves significant improvement in large-scale image classification on Pascal VOC-2007 and VOC-2012 datasets.