{"title":"Learning to Select Elements for Graphic Design","authors":"Guolong Wang, Zheng Qin, Junchi Yan, Liu Jiang","doi":"10.1145/3372278.3390678","DOIUrl":"https://doi.org/10.1145/3372278.3390678","url":null,"abstract":"Selecting elements for graphic design is essential for ensuring a correct understanding of clients' requirements as well as improving the efficiency of designers before a fine-designed process. Some semi-automatic design tools proposed layout templates where designers always select elements according to the rectangular boxes that specify how elements are placed. In practice, layout and element selection are complementary. Compared to the layout which can be readily obtained from pre-designed templates, it is generally time-consuming to mindfully pick out suitable elements, which calls for an automation of elements selection. To address this, we formulate element selection as a sequential decision-making process and develop a deep element selection network (DESN). Given a layout file with annotated elements, new graphical elements are selected to form graphic designs based on aesthetics and consistency criteria. To train our DESN, we propose an end-to-end, reinforcement learning based framework, where we design a novel reward function that jointly accounts for visual aesthetics and consistency. Based on this, visually readable and aesthetic drafts can be efficiently generated. We further contribute a layout-poster dataset with exhaustively labeled attributes of poster key elements. Qualitative and quantitative results indicate the efficacy of our approach.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115084197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Base Class Selection Algorithms for Few-Shot Classification","authors":"Takumi Ohkuma, Hideki Nakayama","doi":"10.1145/3372278.3390724","DOIUrl":"https://doi.org/10.1145/3372278.3390724","url":null,"abstract":"Few-shot classification is a task to learn a classifier for novel classes with a limited number of examples on top of the known base classes which have a sufficient number of examples. In recent years, significant progress has been achieved on this task. However, despite the importance of selecting the base classes themselves for better knowledge transfer, few works have paid attention to this point. In this paper, we propose two types of base class selection algorithms that are suitable for few-shot classification tasks. One is based on the thesaurus-tree structure of class names, and the other is based on word embeddings. In our experiments using representative few-shot learning methods on the ILSVRC dataset, we show that these two algorithms can significantly improve the performance compared to a naive class selection method. Moreover, they do not require high computational and memory costs, which is an important advantage to scale to a very large number of base classes.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129073946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Actor-Critic Sequence Generation for Relative Difference Captioning","authors":"Z. Fei","doi":"10.1145/3372278.3390679","DOIUrl":"https://doi.org/10.1145/3372278.3390679","url":null,"abstract":"This paper investigates a new task named relative difference caption which aims to generate a sentence to tell the difference between the given image pair. Difference description is a crucial task for developing intelligent machines that can understand and handle changeable visual scenes and applications. Towards that end, we propose a reinforcement learning-based model, which utilizes a policy network and a value network in a decision procedure to collaboratively produce a difference caption. Specifically, the policy network works as an actor to estimate the probability of next word based on the current state and the value network serves as a critic to predict all possible extension values according to current action and state. To encourage generating correct and meaningful descriptions, we leverage a visual-linguistic similarity-based reward function as feedback. Empirical results on the recently released dataset demonstrate the effectiveness of our method in comparison with various baselines and model variants.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114145158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahin Sharifi Noorian, S. Qiu, A. Psyllidis, A. Bozzon, G. Houben
{"title":"Detecting, Classifying, and Mapping Retail Storefronts Using Street-level Imagery","authors":"Shahin Sharifi Noorian, S. Qiu, A. Psyllidis, A. Bozzon, G. Houben","doi":"10.1145/3372278.3390706","DOIUrl":"https://doi.org/10.1145/3372278.3390706","url":null,"abstract":"Up-to-date listings of retail stores and related building functions are challenging and costly to maintain. We introduce a novel method for automatically detecting, geo-locating, and classifying retail stores and related commercial functions, on the basis of storefronts extracted from street-level imagery. Specifically, we present a deep learning approach that takes storefronts from street-level imagery as input, and directly provides the geo-location and type of commercial function as output. Our method showed a recall of 89.05% and a precision of 88.22% on a real-world dataset of street-level images, which experimentally demonstrated that our approach achieves human-level accuracy while having a remarkable run-time efficiency compared to methods such as Faster Region-Convolutional Neural Networks (Faster R-CNN) and Single Shot Detector (SSD).","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115517415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Attention Multimodal Sentiment Analysis","authors":"Taeyong Kim, Bowon Lee","doi":"10.1145/3372278.3390698","DOIUrl":"https://doi.org/10.1145/3372278.3390698","url":null,"abstract":"Sentiment analysis plays an important role in natural-language processing. It has been performed on multimodal data including text, audio, and video. Previously conducted research does not make full utilization of such heterogeneous data. In this study, we propose a model of Multi-Attention Recurrent Neural Network (MA-RNN) for performing sentiment analysis on multimodal data. The proposed network consists of two attention layers and a Bidirectional Gated Recurrent Neural Network (BiGRU). The first attention layer is used for data fusion and dimensionality reduction, and the second attention layer is used for the augmentation of BiGRU to capture key parts of the contextual information among utterances. Experiments on multimodal sentiment analysis indicate that our proposed model achieves the state-of-the-art performance of 84.31% accuracy on the Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset. Furthermore, an ablation study is conducted to evaluate the contributions of different components of the network. We believe that our findings of this study may also offer helpful insights into the design of models using multimodal data.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132514895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhi Xiong, Dayan Wu, Wen Gu, Haisu Zhang, Bo Li, Weiping Wang
{"title":"Deep Discrete Attention Guided Hashing for Face Image Retrieval","authors":"Zhi Xiong, Dayan Wu, Wen Gu, Haisu Zhang, Bo Li, Weiping Wang","doi":"10.1145/3372278.3390683","DOIUrl":"https://doi.org/10.1145/3372278.3390683","url":null,"abstract":"Recently, face image hashing has been proposed in large-scale face image retrieval due to its storage and computational efficiency. However, owing to the large intra-identity variation (same identity with different poses, illuminations, and facial expressions) and the small inter-identity separability (different identities look similar) of face images, existing face image hashing methods have limited power to generate discriminative hash codes. In this work, we propose a deep hashing method specially designed for face image retrieval named deep Discrete Attention Guided Hashing (DAGH). In DAGH, the discriminative power of hash codes is enhanced by a well-designed discrete identity loss, where not only the separability of the learned hash codes for different identities is encouraged, but also the intra-identity variation of the hash codes for the same identities is compacted. Besides, to obtain the fine-grained face features, DAGH employs a multi-attention cascade network structure to highlight discriminative face features. Moreover, we introduce a discrete hash layer into the network, along with the proposed modified backpropagation algorithm, our model can be optimized under discrete constraint. Experiments on two widely used face image retrieval datasets demonstrate the inspiring performance of DAGH over the state-of-the-art face image hashing methods.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127227622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiyan Fu, Ying Li, Hengheng Zhang, Jinfeng Liu, Tao Yao
{"title":"Rank-embedded Hashing for Large-scale Image Retrieval","authors":"Haiyan Fu, Ying Li, Hengheng Zhang, Jinfeng Liu, Tao Yao","doi":"10.1145/3372278.3390716","DOIUrl":"https://doi.org/10.1145/3372278.3390716","url":null,"abstract":"With the growth of images on the Internet, plenty of hashing methods are developed to handle the large-scale image retrieval task. Hashing methods map data from high dimension to compact codes, so that they can effectively cope with complicated image features. However, the quantization process of hashing results in unescapable information loss. As a consequence, it is a challenge to measure the similarity between images with generated binary codes. The latest works usually focus on learning deep features and hashing functions simultaneously to preserve the similarity between images, while the similarity metric is fixed. In this paper, we propose a Rank-embedded Hashing (ReHash) algorithm where the ranking list is automatically optimized together with the feedback of the supervised hashing. Specifically, ReHash jointly trains the metric learning and the hashing codes in an end-to-end model. In this way, the similarity between images are enhanced by the ranking process. Meanwhile, the ranking results are an additional supervision for the hashing function learning as well. Extensive experiments show that our ReHash outperforms the state-of-the-art hashing methods for large-scale image retrieval.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126492928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge Enhanced Neural Fashion Trend Forecasting","authors":"Yunshan Ma, Yujuan Ding, Xun Yang, Lizi Liao, Wai Keung Wong, Tat-Seng Chua","doi":"10.1145/3372278.3390677","DOIUrl":"https://doi.org/10.1145/3372278.3390677","url":null,"abstract":"Fashion trend forecasting is a crucial task for both academia andindustry. Although some efforts have been devoted to tackling this challenging task, they only studied limited fashion elements with highly seasonal or simple patterns, which could hardly reveal thereal fashion trends. Towards insightful fashion trend forecasting,this work focuses on investigating fine-grained fashion element trends for specific user groups. We first contribute a large-scale fashion trend dataset (FIT) collected from Instagram with extracted time series fashion element records and user information. Furthermore, to effectively model the time series data of fashion elements with rather complex patterns, we propose a Knowledge Enhanced Recurrent Network model (KERN) which takes advantage of the capability of deep recurrent neural networks in modeling time series data. Moreover, it leverages internal and external knowledgein fashion domain that affects the time-series patterns of fashion element trends. Such incorporation of domain knowledge further enhances the deep learning model in capturing the patterns of specific fashion elements and predicting the future trends. Extensive experiments demonstrate that the proposed KERN model can effectively capture the complicated patterns of objective fashion elements, therefore making preferable fashion trend forecast.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keith Curtis, G. Awad, Shahzad Rajput, I. Soboroff
{"title":"HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do","authors":"Keith Curtis, G. Awad, Shahzad Rajput, I. Soboroff","doi":"10.1145/3372278.3390742","DOIUrl":"https://doi.org/10.1145/3372278.3390742","url":null,"abstract":"In this paper we propose a new evaluation challenge and direction in the area of High-level Video Understanding. The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other. A pilot High-Level Video Understanding (HLVU) dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them. A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts. The objective is to benchmark if a computer system can \"understand\" non-explicit but obvious relationships the same way humans do when they watch the same movies. This is long-standing problem that is being addressed in the text domain and this project moves similar research to the video domain. Work of this nature is foundational to future video analytics and video understanding technologies. This work can be of interest to streaming services and broadcasters hoping to provide more intuitive ways for their customers to interact with and consume video content.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132154827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federico Vaccaro, M. Bertini, Tiberio Uricchio, A. Bimbo
{"title":"Image Retrieval using Multi-scale CNN Features Pooling","authors":"Federico Vaccaro, M. Bertini, Tiberio Uricchio, A. Bimbo","doi":"10.1145/3372278.3390732","DOIUrl":"https://doi.org/10.1145/3372278.3390732","url":null,"abstract":"In this paper, we address the problem of image retrieval by learning images representation based on the activations of a Convolutional Neural Network. We present an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on NetVLAD and a triplet mining procedure based on samples difficulty to obtain an effective image representation. Extensive experiments show that our approach is able to reach state-of-the-art results on three standard datasets.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123470107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}