{"title":"Video-to-Video Translation with Global Temporal Consistency","authors":"Xingxing Wei, Jun Zhu, Sitong Feng, Hang Su","doi":"10.1145/3240508.3240708","DOIUrl":"https://doi.org/10.1145/3240508.3240708","url":null,"abstract":"Although image-to-image translation has been widely studied, the video-to-video translation is rarely mentioned. In this paper, we propose an unified video-to-video translation framework to accom- plish different tasks, like video super-resolution, video colouriza- tion, and video segmentation, etc. A consequent question within video-to-video translation lies in the flickering appearance along with the varying frames. To overcome this issue, a usual method is to incorporate the temporal loss between adjacent frames in the optimization, which is a kind of local frame-wise temporal con- sistency. We instead present a residual error based mechanism to ensure the video-level consistency of the same location in different frames (called (lobal temporal consistency). The global and local consistency are simultaneously integrated into our video-to-video framework to achieve more stable videos. Our method is based on the GAN framework, where we present a two-channel discrimina- tor. One channel is to encode the video RGB space, and another is to encode the residual error of the video as a whole to meet the global consistency. Extensive experiments conducted on different video- to-video translation tasks verify the effectiveness and flexibleness of the proposed method.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115297850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency","authors":"Yuke Li","doi":"10.1145/3240508.3240551","DOIUrl":"https://doi.org/10.1145/3240508.3240551","url":null,"abstract":"Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115720199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Panel-1","authors":"Jun Jitao, Yu Sang","doi":"10.1145/3286933","DOIUrl":"https://doi.org/10.1145/3286933","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114229590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Joint Multimodal Representation with Adversarial Attention Networks","authors":"Feiran Huang, Xiaoming Zhang, Zhoujun Li","doi":"10.1145/3240508.3240614","DOIUrl":"https://doi.org/10.1145/3240508.3240614","url":null,"abstract":"Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126819615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Gelli, Tiberio Uricchio, Xiangnan He, A. Bimbo, Tat-Seng Chua
{"title":"Beyond the Product: Discovering Image Posts for Brands in Social Media","authors":"Francesco Gelli, Tiberio Uricchio, Xiangnan He, A. Bimbo, Tat-Seng Chua","doi":"10.1145/3240508.3240689","DOIUrl":"https://doi.org/10.1145/3240508.3240689","url":null,"abstract":"Brands and organizations are using social networks such as Instagram to share image or video posts regularly, in order to engage and maximize their presence to the users. Differently from the traditional advertising paradigm, these posts feature not only specific products, but also the value and philosophy of the brand, known as brand associations in marketing literature. In fact, marketers are spending considerable resources to generate their content in-house, and increasingly often, to discover and repost the content generated by users. However, to choose the right posts for a brand in social media remains an open problem. Driven by this real-life application, we define the new task of content discovery for brands, which aims to discover posts that match the marketing value and brand associations of a target brand. We identify two main challenges in this new task: high inter-brand similarity and brand-post sparsity; and propose a tailored content-based learning-to-rank system to discover content for a target brand. Specifically, our method learns fine-grained brand representation via explicit modeling of brand associations, which can be interpreted as visual words shared among brands. We collected a new large-scale Instagram dataset, consisting of more than 1.1 million image and video posts from the history of 927 brands of fourteen verticals such as food and fashion. Extensive experiments indicate that our model can effectively learn fine-grained brand representations and outperform the closest state-of-the-art solutions.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126152852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kecheng Zheng, Zhengjun Zha, Yang Cao, X. Chen, Feng Wu
{"title":"LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation","authors":"Kecheng Zheng, Zhengjun Zha, Yang Cao, X. Chen, Feng Wu","doi":"10.1145/3240508.3240628","DOIUrl":"https://doi.org/10.1145/3240508.3240628","url":null,"abstract":"Depth estimation from monocular images is an ill-posed and inherently ambiguous problem. Recently, deep learning technique has been applied for monocular depth estimation seeking data-driven solutions. However, most existing methods focus on pursuing the minimization of average depth regression error at pixel level and neglect to encode the global layout of scene, resulting in layout-inconsistent depth map. This paper proposes a novel Layout-Aware Convolutional Neural Network (LA-Net) for accurate monocular depth estimation by simultaneously perceiving scene layout and local depth details. Specifically, a Spatial Layout Network (SL-Net) is proposed to learn a layout map representing the depth ordering between local patches. A Layout-Aware Depth Estimation Network (LDE-Net) is proposed to estimate pixel-level depth details using multi-scale layout maps as structural guidance, leading to layout-consistent depth map. A dense network module is used as the base network to learn effective visual details resorting to dense feed-forward connections. Moreover, we formulate an order-sensitive softmax loss to well constrain the ill-posed depth inferring problem. Extensive experiments on both indoor scene (NYUD-v2) and outdoor scene (Make3D) datasets have demonstrated that the proposed LA-Net outperforms the state-of-the-art methods and leads to faithful 3D projections.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126930512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, Debin Zhao
{"title":"Paragraph Generation Network with Visual Relationship Detection","authors":"Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, Debin Zhao","doi":"10.1145/3240508.3240695","DOIUrl":"https://doi.org/10.1145/3240508.3240695","url":null,"abstract":"Paragraph generation of images is a new concept, aiming to produce multiple sentences to describe a given image. In this paper, we propose a paragraph generation network with introducing visual relationship detection. We first detect regions which may contain important visual objects and then predict their relationships. Paragraphs are produced based on object regions which have valid relationship with others. Compared with previous works which generate sentences based on region features, we explicitly explore and utilize visual relationships in order to improve final captions. The experimental results show that such strategy could improve paragraph generating performance from two aspects: more details about object relations are detected and more accurate sentences are obtained. Furthermore, our model is more robust to region detection fluctuation.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123774351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lizi Liao, Xiangnan He, Bo Zhao, C. Ngo, Tat-Seng Chua
{"title":"Interpretable Multimodal Retrieval for Fashion Products","authors":"Lizi Liao, Xiangnan He, Bo Zhao, C. Ngo, Tat-Seng Chua","doi":"10.1145/3240508.3240646","DOIUrl":"https://doi.org/10.1145/3240508.3240646","url":null,"abstract":"Deep learning methods have been successfully applied to fashion retrieval. However, the latent meaning of learned feature vectors hinders the explanation of retrieval results and integration of user feedback. Fortunately, there are many online shopping websites organizing fashion items into hierarchical structures based on product taxonomy and domain knowledge. Such structures help to reveal how human perceive the relatedness among fashion products. Nevertheless, incorporating structural knowledge for deep learning remains a challenging problem. This paper presents techniques for organizing and utilizing the fashion hierarchies in deep learning to facilitate the reasoning of search results and user intent. The novelty of our work originates from the development of an EI (Exclusive & Independent) tree that can cooperate with deep models for end-to-end multimodal learning. EI tree organizes the fashion concepts into multiple semantic levels and augments the tree structure with exclusive as well as independent constraints. It describes the different relationships among sibling concepts and guides the end-to-end learning of multi-level fashion semantics. From EI tree, we learn an explicit hierarchical similarity function to characterize the semantic similarities among fashion products. It facilitates the interpretable retrieval scheme that can integrate the concept-level feedback. Experiment results on two large fashion datasets show that the proposed approach can characterize the semantic similarities among fashion items accurately and capture user's search intent precisely, leading to more accurate search results as compared to the state-of-the-art methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121612913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Supervised DFF: Decoupling Detection and Feature Flow for Video Object Detectors","authors":"Guangxing Han, Xuan Zhang, Chongrong Li","doi":"10.1145/3240508.3240693","DOIUrl":"https://doi.org/10.1145/3240508.3240693","url":null,"abstract":"For efficient video object detection, our detector consists of a spatial module and a temporal module. The spatial module aims to detect objects in static frames using convolutional networks, and the temporal module propagates high-level CNN features to nearby frames via light-weight feature flow. Alternating the spatial and temporal module by a proper interval makes our detector fast and accurate. Then we propose a two-stage semi-supervised learning framework to train our detector, which fully exploits unlabeled videos by decoupling the spatial and temporal module. In the first stage, the spatial module is learned by traditional supervised learning. In the second stage, we employ both feature regression loss and feature semantic loss to learn our temporal module via unsupervised learning. Different to traditional methods, our method can largely exploit unlabeled videos and bridges the gap of object detectors in image and video domain. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of our method. Code will be made publicly available.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"27 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges and Practices of Large Scale Visual Intelligence in the Real-World","authors":"Xiansheng Hua","doi":"10.1145/3240508.3267342","DOIUrl":"https://doi.org/10.1145/3240508.3267342","url":null,"abstract":"Visual intelligence is one of the key aspects of Artificial Intelligence. Considerable technology progresses along this direction have been made in the past a few years. However, how to incubate the right technologies and convert them into real business values in the real-world remains a challenge. In this talk, we will analyze current challenges of visual intelligence in the real-world and try to summarize a few key points that help us successfully develop and apply technologies to solve real-world problems. In particular, we will introduce a few successful examples, including \"City Brain\", \"Luban (visual design)\", from the problem definition/discovery, to technology development, to product design, and to realizing business values. City Brain: A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data is nontrivial. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big-city data, specifically videos, with the assistance of rapidly evolving AI technologies and fast-growing computing capacity. From cognition to optimization, to decision-making, from search to prediction and ultimately, to intervention, City Brain improves the way we manage the city, as well as the way we live in it. In this talk, we will introduce current practices of the City Brain platform, as well as what we can do to achieve the goal and make it a reality, step by step. Luban: Different from most typical visual intelligence technologies, which are more focused on analyzing, recognizing or searching visual objects, the goal of Luban (visual design) is to create visual content. In particular, we will introduce an automatic 2D banner design technique that is based on deep learning and reinforcement learning. We will detail how Luban was created and how it changed the world of 2D banner design by creating 50M banners a day.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133875108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}