{"title":"Visual Question Answering With a Hybrid Convolution Recurrent Model","authors":"Philipp Harzig, C. Eggert, R. Lienhart","doi":"10.1145/3206025.3206054","DOIUrl":"https://doi.org/10.1145/3206025.3206054","url":null,"abstract":"Visual Question Answering (VQA) is a relatively new task, which tries to infer answer sentences for an input image coupled with a corresponding question. Instead of dynamically generating answers, they are usually inferred by finding the most probable answer from a fixed set of possible answers. Previous work did not address the problem of finding all possible answers, but only modeled the answering part of VQA as a classification task. To tackle this problem, we infer answer sentences by using a Long Short-Term Memory (LSTM) network that allows us to dynamically generate answers for (image, question) pairs. In a series of experiments, we discover an end-to-end Deep Neural Network structure, which allows us to dynamically answer questions referring to a given input image by using an LSTM decoder network. With this approach, we are able to generate both less common answers, which are not considered by classification models, and more complex answers with the appearance of datasets containing answers that consist of more than three words.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123285799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Who to Ask: An Intelligent Fashion Consultant","authors":"Yangbangyan Jiang, Qianqian Xu, Xiaochun Cao, Qingming Huang","doi":"10.1145/3206025.3206092","DOIUrl":"https://doi.org/10.1145/3206025.3206092","url":null,"abstract":"Humankind has always been in pursuit of fashion. Nevertheless, people are often troubled by collocating clothes, e.g., tops, bottoms, shoes, and accessories, from numerous fashion items in their closets. Moreover, it may be expensive and inconvenient to employ a fashion stylist. In this paper, we present Stile, an end-to-end intelligent fashion consultant system, to generate stylish outfits for given items. Unlike previous systems, our framework considers the global compatibility of fashion items in the outfit and models the dependencies among items in a fixed order via a bidirectional LSTM. Therefore, it can guarantee that items in the same outfit should share a similar style and neither redundant nor missing items exist in the resulting outfit for essential categories. The demonstration shows that our proposed system provides people with a practical and convenient solution to find natural and proper fashion outfits.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123329139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recognizing Actions in Wearable-Camera Videos by Training Classifiers on Fixed-Camera Videos","authors":"Yang Mi, Kang Zheng, Song Wang","doi":"10.1145/3206025.3206041","DOIUrl":"https://doi.org/10.1145/3206025.3206041","url":null,"abstract":"Recognizing human actions in wearable camera videos, such as videos taken by GoPro or Google Glass, can benefit many multimedia applications. By mixing the complex and non-stop motion of the camera, motion features extracted from videos of the same action may show very large variation and inconsistency. It is very difficult to collect sufficient videos to cover all such variations and use them to train action classifiers with good generalization ability. In this paper, we develop a new approach to train action classifiers on a relatively smaller set of fixed-camera videos with different views, and then apply them to recognize actions in wearable-camera videos. In this approach, we temporally divide the input video into many shorter video segments and transform the motion features to stable ones in each video segment, in terms of a fixed view defined by an anchor frame in the segment. Finally, we use sparse coding to estimate the action likelihood in each segment, followed by combining the likelihoods from all the video segments for action recognition. We conduct experiments by training on a set of fixed-camera videos and testing on a set of wearable-camera videos, with very promising results.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125340326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Annotation Retrieval with Text-Domain Label Denoising","authors":"Zachary Seymour, Zhongfei Zhang","doi":"10.1145/3206025.3206063","DOIUrl":"https://doi.org/10.1145/3206025.3206063","url":null,"abstract":"This work explores the problem of making user-generated text data, in the form of noisy tags, usable for tasks such as automatic image annotation and image retrieval by denoising the data. Earlier work in this area has focused on filtering out noisy, sparse, or incorrect tags by representing an image by the accumulation of the tags of its nearest neighbors in the visual space. However, this imposes an expensive preprocessing step that must be performed for each new set of images and tags and relies on assumptions about the way the images have been labelled that we find do not always hold. We instead propose a technique for calculating a set of probabilities for the relevance of each tag for a given image relying soley on information in the text domain, namely through widely-available pretrained continous word embeddings. By first clustering the word embeddings for the tags, we calculate a set of weights representing the probability that each tag is meaningful to the image content. Given the set of tags denoised in this way, we use kernel canonical correlation analysis (KCCA) to learn a semantic space which we can project into to retrieve relevant tags for unseen images or to retrieve images for unseen tags. This work also explores the deficiencies of the use of continuous word embeddings for automatic image annotation in the existing KCCA literature and introduces a new method for constructing textual kernel matrices using these word vectors that improves tag retrieval results for both user-generated tags as well as expert labels.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115185771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dense Dilated Network for Few Shot Action Recognition","authors":"Baohan Xu, Hao Ye, Yingbin Zheng, Heng Wang, Tianyu Luwang, Yu-Gang Jiang","doi":"10.1145/3206025.3206028","DOIUrl":"https://doi.org/10.1145/3206025.3206028","url":null,"abstract":"Recently, video action recognition has been widely studied. Training deep neural networks requires a large amount of well-labeled videos. On the other hand, videos in the same class share high-level semantic similarity. In this paper, we introduce a novel neural network architecture to simultaneously capture local and long-term spatial temporal information. The dilated dense network is proposed with the blocks being composed of densely-connected dilated convolutions layers. The proposed framework is capable of fusing each layer's outputs to learn high-level representations, and the representations are robust even with only few training snippets. The aggregations of dilated dense blocks are also explored. We conduct extensive experiments on UCF101 and demonstrate the effectiveness of our proposed method, especially with few training examples.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115198630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Construction and Manipulation of Hierarchical Quartic Image Graphs","authors":"N. Hezel, K. U. Barthel","doi":"10.1145/3206025.3206093","DOIUrl":"https://doi.org/10.1145/3206025.3206093","url":null,"abstract":"Over the last years, we have published papers about intuitive image graph navigation and showed how to build static hierarchical image graphs efficiently. In this paper, we showcase new results and present techniques to dynamically construct and manipulate these kinds of graphs. They connect similar images and perform well in retrieving tasks regardless of the number of nodes. By applying an improved fast self-sorting map algorithm, entire image collections (structured in a graph) can be explored with a user interface resembling common navigation services.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122165251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","authors":"","doi":"10.1145/3206025","DOIUrl":"https://doi.org/10.1145/3206025","url":null,"abstract":"","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124071578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VP-ReID: Vehicle and Person Re-Identification System","authors":"Longhui Wei, Xiaobin Liu, Jianing Li, Shiliang Zhang","doi":"10.1145/3206025.3206086","DOIUrl":"https://doi.org/10.1145/3206025.3206086","url":null,"abstract":"With the capability of locating and tracking specific suspects or vehicles in a large camera network, person Re-Identification (ReID) and vehicle ReID show potential to be a key technology in smart surveillance system. They have been drawing lots of attentions from both academia and industry. To demonstrate our recent research progresses on those two tasks, we develop a robust and efficient person and video ReID system named as VP-ReID. This system is build based on our recent works including Deep Convolutional Neural Network design for discriminative feature extraction, efficient off-line indexing, as well as distance metric optimization for deep feature learning. Constructed upon those algorithms, VP-ReID identifies query vehicle and person efficiently and accurately from a large gallery set.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127885292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D Image-based Indoor Localization Joint With WiFi Positioning","authors":"G. Lu, Jingkuan Song","doi":"10.1145/3206025.3206070","DOIUrl":"https://doi.org/10.1145/3206025.3206070","url":null,"abstract":"We realize a system that utilizes WiFi to facilitate the image-based localization system, which avoids the confusion caused by the similar decoration inside the buildings. While WiFi-based localization thread obtains the rough location information, the image-based localization thread retrieves the best matching images and clusters the camera poses associated with the images into different location candidates. The image cluster closest to the WiFi localization outcome is selected for the exact camera pose estimation. The usage of WiFi significantly reduces the search scope, avoiding the extensive search of millions of descriptors in a 3D model. In the image-based localization stage, we also propose a novel 2D-to-2D-to-3D localization framework which follows a coarse-to-fine strategy to quickly locate the query image in several location candidates and performs the local feature matching and camera pose estimation after choosing the correct image location by WiFi positioning. The entire system demonstrates significant benefits in combining both images and WiFi signals in localization tasks and great potential to be deployed in real applications.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128067482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Binary Coding by Matrix Classifier for Efficient Subspace Retrieval","authors":"Lei Zhou, Xiao Bai, Xianglong Liu, Jun Zhou","doi":"10.1145/3206025.3206058","DOIUrl":"https://doi.org/10.1145/3206025.3206058","url":null,"abstract":"Fast retrieval in large-scale database with high-dimensional subspaces is an important task in many applications, such as image retrieval, video retrieval and visual recognition. This can be facilitated by approximate nearest subspace (ANS) retrieval which requires effective subspace representation. Most of the existing methods for this problem represent subspace by point in the Euclidean space or the Grassmannian space before applying the approximate nearest neighbor (ANN) search. However, the efficiency of these methods can not be guaranteed because the subspace representation step can be very time consuming when coping with high dimensional data. Moreover, the transforming process for subspace to point will cause subspace structural information loss which influence the retrieval accuracy. In this paper, we present a new approach for hashing-based ANS retrieval. The proposed method learns the binary codes for given subspace set following a similarity preserving criterion. It simultaneously leverages the learned binary codes to train matrix classifiers as hash functions. This method can directly binarize a subspace without transforming it into a vector. Therefore, it can efficiently solve the large-scale and high-dimensional multimedia data retrieval problem. Experiments on face recognition and video retrieval show that our method outperforms several state-of-the-art methods in both efficiency and accuracy.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125696546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}