{"title":"A Unified Framework for Multimodal Domain Adaptation","authors":"Fan Qi, Xiaoshan Yang, Changsheng Xu","doi":"10.1145/3240508.3240633","DOIUrl":"https://doi.org/10.1145/3240508.3240633","url":null,"abstract":"Domain adaptation aims to train a model on labeled data from a source domain while minimizing test error on a target domain. Most of existing domain adaptation methods only focus on reducing domain shift of single-modal data. In this paper, we consider a new problem of multimodal domain adaptation and propose a unified framework to solve it. The proposed multimodal domain adaptation neural networks(MDANN) consist of three important modules. (1) A covariant multimodal attention is designed to learn a common feature representation for multiple modalities. (2) A fusion module adaptively fuses attended features of different modalities. (3) Hybrid domain constraints are proposed to comprehensively learn domain-invariant features by constraining single modal features, fused features, and attention scores. Through jointly attending and fusing under an adversarial objective, the most discriminative and domain-adaptive parts of the features are adaptively fused together. Extensive experimental results on two real-world cross-domain applications (emotion recognition and cross-media retrieval) demonstrate the effectiveness of the proposed method.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132941728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: FF-6","authors":"B. Huet","doi":"10.1145/3286943","DOIUrl":"https://doi.org/10.1145/3286943","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133034224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End2End Semantic Segmentation for 3D Indoor Scenes","authors":"Na Zhao","doi":"10.1145/3240508.3243933","DOIUrl":"https://doi.org/10.1145/3240508.3243933","url":null,"abstract":"This research is concerned with semantic segmentation of 3D point clouds arising from videos of 3D indoor scenes. It is an important building block of 3D scene understanding and has promising applications such as augmented reality and robotics. Although various deep learning based approaches have been proposed to replicate the success of 2D semantic segmentation in 3D domain, they either result in severe information loss or fail to model the geometric structures well. In this paper, we aim to model the local and global geometric structures of 3D scenes by designing an end-to-end 3D semantic segmentation framework. It captures the local geometries from point-level feature learning and voxel-level aggregation, models the global structures via 3D CNN, and enforces label consistency with high-order CRF. Through preliminary experiments conducted on two indoor datasets, we describe our insights on the proposed approach, and present some directions to be pursued in the future.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129840592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksandr Farseev, Kirill Lepikhin, H. Schwartz, Eu Khoon Ang, K. Powar
{"title":"SoMin.ai: Social Multimedia Influencer Discovery Marketplace","authors":"Aleksandr Farseev, Kirill Lepikhin, H. Schwartz, Eu Khoon Ang, K. Powar","doi":"10.1145/3240508.3241387","DOIUrl":"https://doi.org/10.1145/3240508.3241387","url":null,"abstract":"In this technical demonstration, we showcase the first ai-driven social multimedia influencer discovery marketplace, called SoMin. The platform combines advanced data analytics and behavioral science to help marketers find, understand their audience and engage the most relevant social media micro-influencers at a large scale. SoMin harvests brand-specific life social multimedia streams in a specified market domain, followed by rich analytics and semantic-based influencer search. The Individual User Profiling models extrapolate the key personal characteristics of the brand audience, while the influencer retrieval engine reveals the semantically-matching social media influencers to the platform users. The influencers are matched in terms of both their-posted content and social media audiences, while the evaluation results demonstrate an excellent performance of the proposed recommender framework. By leveraging influencers at a large scale, marketers will be able to execute more effective marketing campaigns of higher trust and at a lower cost.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132789930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Facial Expression Recognition Enhanced by Thermal Images through Adversarial Learning","authors":"Bowen Pan, Shangfei Wang","doi":"10.1145/3240508.3240608","DOIUrl":"https://doi.org/10.1145/3240508.3240608","url":null,"abstract":"Currently, fusing visible and thermal images for facial expression recognition requires two modalities during both training and testing. Visible cameras are commonly used in real-life applications, and thermal cameras are typically only available in lab situations due to their high price. Thermal imaging for facial expression recognition is not frequently used in real-world situations. To address this, we propose a novel thermally enhanced facial expression recognition method which uses thermal images as privileged information to construct better visible feature representation and improved classifiers by incorporating adversarial learning and similarity constraints during training. Specifically, we train two deep neural networks from visible images and thermal images. We impose adversarial loss to enforce statistical similarity between the learned representations of two modalities, and a similarity constraint to regulate the mapping functions from visible and thermal representation to expressions. Thus, thermal images are leveraged to simultaneously improve visible feature representation and classification during training. To mimic real-world scenarios, only visible images are available during testing. We further extend the proposed expression recognition method for partially unpaired data to explore thermal images' supplementary role in visible facial expression recognition when visible images and thermal images are not synchronously recorded. Experimental results on the MAHNOB Laughter database demonstrate that our proposed method can effectively regularize visible representation and expression classifiers with the help of thermal images, achieving state-of-the-art recognition performance.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132903195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Mahzari, A. T. Nasrabadi, Aliehsan Samiei, R. Prakash
{"title":"FoV-Aware Edge Caching for Adaptive 360° Video Streaming","authors":"A. Mahzari, A. T. Nasrabadi, Aliehsan Samiei, R. Prakash","doi":"10.1145/3240508.3240680","DOIUrl":"https://doi.org/10.1145/3240508.3240680","url":null,"abstract":"In recent years, there has been growing popularity of Virtual Reality (VR), enabled by technologies like 360° video streaming. Streaming 360° video is extremely challenging due to high bandwidth and low latency requirements. Some VR solutions employ adaptive 360° video streaming which tries to reduce bandwidth consumption by only streaming high resolution video for user's Field of View (FoV). FoV is the part of the video which is being viewed by the user at any given time. Although FoV-adaptive 360° video streaming has been helpful in reducing bandwidth requirements, streaming 360° video from distant content servers is still challenging due to network latency. Caching popular content close to the end users not only decreases network latency, but also alleviates network bandwidth demands by reducing the number of future requests that have to be sent all the way to remote content servers. In this paper, we propose a novel caching policy based on users' FoV, called FoV-aware caching policy. In FoV-aware caching policy, we learn a probabilistic model of common-FoV for each 360° video based on previous users' viewing histories to improve caching performance. Through experiments with real users' head movement dataset, we show that our proposed approach improves cache hit ratio compared to Least Frequently Used (LFU) and Least Recently Used (LRU) caching policies by at least 40% and 17%, respectively.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130246256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teng Long, Xing Xu, Youyou Li, Fumin Shen, Jingkuan Song, Heng Tao Shen
{"title":"Pseudo Transfer with Marginalized Corrupted Attribute for Zero-shot Learning","authors":"Teng Long, Xing Xu, Youyou Li, Fumin Shen, Jingkuan Song, Heng Tao Shen","doi":"10.1145/3240508.3240715","DOIUrl":"https://doi.org/10.1145/3240508.3240715","url":null,"abstract":"Zero-shot learning (ZSL) aims to recognize unseen classes that are excluded from training classes. ZSL suffers from 1) Zero-shot bias (Z-Bias) --- model is biased towards seen classes because unseen data is inaccessible for training; 2) Zero-shot variance (Z-Variance) --- associating different images to same semantic embedding yields large associating error. To reduce Z-Bias, we propose a pseudo transfer mechanism, where we first synthesize the distribution of unseen data using semantic embeddings, then we minimize the mismatch between the seen distribution and the synthesized unseen distribution. To reduce Z-Variance, we implicitly corrupted one semantic embedding multiple times to generate image-wise semantic vectors, with which our model learn robust classifiers. Lastly, we integrate our Z-Bias and Z-variance reduction techniques with a linear ZSL model to show its usefulness. Our proposed model successfully overcomes the Z-bias and Z-variance problems. Extensive experiments on five benchmark datasets including ImageNet-1K demonstrate that our model outperforms the state-of-the-art methods with fast training.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"249 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Vision-2 (Object & Scene Understanding)","authors":"Zhengjun Zha","doi":"10.1145/3286922","DOIUrl":"https://doi.org/10.1145/3286922","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114877056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanjie Liang, Qiangqiang Wu, Yi Liu, Y. Yan, Hanzi Wang
{"title":"Robust Correlation Filter Tracking with Shepherded Instance-Aware Proposals","authors":"Yanjie Liang, Qiangqiang Wu, Yi Liu, Y. Yan, Hanzi Wang","doi":"10.1145/3240508.3240709","DOIUrl":"https://doi.org/10.1145/3240508.3240709","url":null,"abstract":"In recent years, convolutional neural network (CNN) based correlation filter trackers have achieved state-of-the-art results on the benchmark datasets. However, the CNN based correlation filters cannot effectively handle large scale variation and distortion (such as fast motion, background clutter, occlusion, etc.), leading to the sub-optimal performance. In this paper, we propose a novel CNN based correlation filter tracker with shepherded instance-aware proposals, namely DeepCFIAP, which automatically estimates the target scale in each frame and re-detects the target when distortion happens. DeepCFIAP is proposed to take advantage of the merits of both instance-aware proposals and CNN based correlation filters. Compared with the CNN based correlation filter trackers, DeepCFIAP can successfully solve the problems of large scale variation and distortion via the shepherded instance-aware proposals, resulting in more robust tracking performance. Specifically, we develop a novel proposal ranking algorithm based on the similarities between proposals and instances. In contrast to the detection proposal based trackers, DeepCFIAP shepherds the instance-aware proposals towards their optimal positions via the CNN based correlation filters, resulting in more accurate tracking results. Extensive experiments on two challenging benchmark datasets demonstrate that the proposed DeepCFIAP performs favorably against state-of-the-art trackers and it is especially feasible for long-term tracking.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116485528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yoonjung Park, Yoonsik Yang, Hyocheol Ro, Jinwon Cha, Kyuri Kim, T. Han
{"title":"ChildAR-bot: Educational Playing Projection-based AR Robot for Children","authors":"Yoonjung Park, Yoonsik Yang, Hyocheol Ro, Jinwon Cha, Kyuri Kim, T. Han","doi":"10.1145/3240508.3241362","DOIUrl":"https://doi.org/10.1145/3240508.3241362","url":null,"abstract":"Children encounter a variety of experiences through play, which can improve their ability to form ideas and undergo multi-faceted development. Using Augmented Reality (AR) technology to integrate various digital learning elements with real environments can lead to increased learning ability. This study proposes a 360° rotatable and portable system specialized for education and development through projection-based AR play. This system allows existing projection-based AR technology, which once could only be experienced at large-scale exhibitions and experience centers, to be used in individual and small-scale spaces. It also promotes the development of multi-sensory abilities through a multi-modality which provides various intuitive and sensory interactions. By experiencing the various educational play applications provided by the proposed system, children can increase their physical, perceptive, and emotional abilities and thinking skills.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116747928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}