{"title":"Multi-task Deep Neural Network for Joint Face Recognition and Facial Attribute Prediction","authors":"Zhanxiong Wang, Keke He, Yanwei Fu, Rui Feng, Yu-Gang Jiang, X. Xue","doi":"10.1145/3078971.3078973","DOIUrl":"https://doi.org/10.1145/3078971.3078973","url":null,"abstract":"Deep neural networks have significantly improved the performance of face recognition and facial attribute prediction, which however are still very challenging on the million scale dataset, i.e. MegaFace. In this paper, we for the first time, advocate a multi-task deep neural network for jointly learning face recognition and facial attribute prediction tasks. Extensive experimental evaluation clearly demonstrates the effectiveness of our architecture. Remarkably, on the largest face recognition benchmark -- MegaFace dataset, our networks can achieve the Rank-1 identication accuracy of 77.74% and face verication accuracy 79.24% TAR at 10-6 FAR, which are the best performance on the small protocol among all the publicly released methods.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124756063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"With 5G Approaching, How will Audio/Video Technology that Serves 800 Million QQ Users Bring Forth New Ideas","authors":"Xiaozheng Huang","doi":"10.1145/3078971.3081369","DOIUrl":"https://doi.org/10.1145/3078971.3081369","url":null,"abstract":"Back to 1999, a popular IM QQ in China, stilled called OICQ at that time, released a new version, which included the functionality of audio call for the first time. Not much time later, video call was also enabled. After 18 years of fast growing, QQ has 800 million monthly active users.QQ users spend 1.2 billion minutes for audio and video call every single day. With QQ's fast growing, the audio and video technology behind it also evolves tremendously. We build our own audio/video technology center, which grows to Tencent Audio/Video Lab, and develops our own SDK when OEM cannot meet our needs. The new generation of audio/video communication engine \"SPEAR\", developed by our own, serves 800 million QQ users today. Our web broadcasting solution serves China's 10 top web broadcasting platforms, with 200 million user base and 70% market share of China. With 5G approaching, how will audio/video technology that serves 800 million QQ users bring forth new ideas? In this presentation, I will firstly introduce how the audio/video technology develops in Tencent Audio/Video Lab while internet transferring from PC to mobile. Secondly, I will explain the capability of our technology in the field of audio/video web communication, web broadcasting and image/audio/video processing. Thirdly, I will present our new research results and how they are used in our products and services. Then, I will talk a little about our future plan.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126131772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaeyoung Choi, M. Larson, Xinchao Li, Kevin Li, G. Friedland, A. Hanjalic
{"title":"The Geo-Privacy Bonus of Popular Photo Enhancements","authors":"Jaeyoung Choi, M. Larson, Xinchao Li, Kevin Li, G. Friedland, A. Hanjalic","doi":"10.1145/3078971.3080543","DOIUrl":"https://doi.org/10.1145/3078971.3080543","url":null,"abstract":"Today's geo-location estimation approaches are able to infer the location of a target image using its visual content alone. These approaches typically exploit visual matching techniques, applied to a large collection of background images with known geo-locations. Users who are unaware that visual analysis and retrieval approaches can compromise their geo-privacy, unwittingly open themselves to risks of crime or other unintended consequences. This paper lays the groundwork for a new approach to geo-privacy of social images: Instead of requiring a change of user behavior, we start by investigating users' existing photo-sharing practices. We carry out a series of experiments using a large collection of social images (8.5M) to systematically analyze how photo editing practices impact the performance of geo-location estimation. We find that standard image enhancements, including filters and cropping, already serve as natural geo-privacy protectors. In our experiments, up to 19% of images whose location would otherwise be automatically predictable were unlocalizeable after enhancement. We conclude that it would be wrong to assume that geo-visual privacy is a lost cause in today's world of rapidly maturing machine learning. Instead, protecting users against the unwanted effects of pixel-based inference is a viable research field. A starting point is understanding the geo-privacy bonus of already established user behavior.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129067227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discrete Multi-view Hashing for Effective Image Retrieval","authors":"Rui Yang, Yuliang Shi, Xin-Shun Xu","doi":"10.1145/3078971.3078981","DOIUrl":"https://doi.org/10.1145/3078971.3078981","url":null,"abstract":"Recently, hashing techniques have witnessed an increase in popularity due to their low storage cost and high query speed for large scale data retrieval task, e.g., image retrieval. Many methods have been proposed; however, most existing hashing techniques focus on single view data. In many scenarios, there are multiple views in data samples. Thus, those methods working on single view can not make full use of rich information contained in multi-view data. Although some methods have been proposed for multi-view data; they usually relax binary constraints or separate the process of learning hash functions and binary codes into two independent stages to bypass the obstacle of handling the discrete constraints on binary codes for optimization, which may generate large quantization error. To consider these problems, in this paper, we propose a novel hashing method, i.e., Discrete Multi-view Hashing (DMVH), which can work on multi-view data directly and make full use of rich information in multi-view data. Moreover, in DMVH, we optimize discrete codes directly instead of relaxing the binary constraints so that we could obtain high-quality hash codes. Simultaneously, we present a novel approach to construct similarity matrix, which can not only preserve local similarity structure, but also keep semantic similarity between data points. To solve the optimization problem in DMVH, we further propose an alternate algorithm. We test the proposed model on three large scale data sets. Experimental results show that it outperforms or is comparable to several state-of-the-arts.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Bois, G. Gravier, Eric Jamet, E. Morin, Maxime Robert, P. Sébillot
{"title":"Linking Multimedia Content for Efficient News Browsing","authors":"R. Bois, G. Gravier, Eric Jamet, E. Morin, Maxime Robert, P. Sébillot","doi":"10.1145/3078971.3079023","DOIUrl":"https://doi.org/10.1145/3078971.3079023","url":null,"abstract":"As the amount of news information available online grows, media are in need of advanced tools to explore the information surrounding specific events before writing their own piece of news, e.g., adding context and insight. While many tools exist to extract information from large datasets, they do not offer an easy way to gain insight from a news collection by browsing, going from article to article and viewing unaltered original content. Such browsing tools require the creation of rich underlying structures such as graph representations. These representations can be further enhanced by typing links that connect nodes, in order to inform the user on the nature of their relation. In this article, we introduce an efficient way to generate links between news items in order to obtain an easily navigable graph, and enrich this graph by automatically typing created links. User evaluations are conducted on real world data in order to assess for the interest of both the graph representation and link typing in a press reviewing task, showing a significant improvement compared to classical search engines.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124198508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Oral Session 1: Vision and Language (Oral Presentations)","authors":"H. Cucu","doi":"10.1145/3254615","DOIUrl":"https://doi.org/10.1145/3254615","url":null,"abstract":"","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125695521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthik Yadati, Cynthia C. S. Liem, M. Larson, A. Hanjalic
{"title":"On the Automatic Identification of Music for Common Activities","authors":"Karthik Yadati, Cynthia C. S. Liem, M. Larson, A. Hanjalic","doi":"10.1145/3078971.3078997","DOIUrl":"https://doi.org/10.1145/3078971.3078997","url":null,"abstract":"In this paper, we address the challenge of identifying music suitable to accompany typical daily activities. We first derive a list of common activities by analyzing social media data. Then, an automatic approach is proposed to find music for these activities. Our approach is inspired by our experimentally acquired findings (a) that genre and instrument information, i.e., as appearing in the textual metadata, are not sufficient to distinguish music appropriate for different types of activities, and (b) that existing content-based approaches in the music information retrieval community do not overcome this insufficiency. The main contributions of our work are (a) our analysis of the properties of activity-related music that inspire our use of novel high-level features, e.g., drop-like events, and (b) our approach's novel method of extracting and combining low-level features, and, in particular, the joint optimization of the time window for feature aggregation and the number of features to be used. The effectiveness of the approach method is demonstrated in a comprehensive experimental study including failure analysis.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131180173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Saliency Estimation and Matching using Image Regions for Geo-Localization of Online Video","authors":"Freda Shi, Jia Chen, Alexander Hauptmann","doi":"10.1145/3078971.3078996","DOIUrl":"https://doi.org/10.1145/3078971.3078996","url":null,"abstract":"In this paper, we study automatic geo-localization of online event videos. Different from general image localization task through matching, the appearance of an environment during significant events varies greatly from its daily appearance, since there are usually crowds, decorations or even destruction when a major event happens. This introduces a major challenge: matching the event environment to the daily environment, e.g. as recorded by Google Street View. We observe that some regions in the image, as part of the environment, still preserve the daily appearance even though the whole image (environment) looks quite different. Based on this observation, we formulate the problem as joint saliency estimation and matching at the image region level, as opposed to the key point or whole-image level. As image-level labels of daily environment are easily generated with GPS information, we treat region based saliency estimation and matching as a weakly labeled learning problem over the training data. Our solution is to iteratively optimize saliency and the region-matching model. For saliency optimization, we derive a closed form solution, which has an intuitive explanation. For region matching model optimization, we use self-paced learning to learn from the pseudo labels generated by (sub-optimal) saliency values. We conduct extensive experiments on two challenging public datasets: Boston Marathon 2013 and Tokyo Time Machine. Experimental results show that our solution significantly improves over matching on whole images and the automatically learned saliency is a strong predictor of distinctive building areas.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132877260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann
{"title":"Leveraging Multi-modal Prior Knowledge for Large-scale Concept Learning in Noisy Web Data","authors":"Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann","doi":"10.1145/3078971.3079003","DOIUrl":"https://doi.org/10.1145/3078971.3079003","url":null,"abstract":"Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community. A considerable amount of videos on the web is associated with rich but noisy contextual information, such as the title and other multi-modal information, which provides weak annotations or labels about the video content. To tackle the problem of large-scale noisy learning, We propose a novel method called Multi-modal WEbly-Labeled Learning (WELL-MM), which is established on the state-of-the-art machine learning algorithm inspired by the learning process of human. WELL-MM introduces a novel multi-modal approach to incorporate meaningful prior knowledge called curriculum from the noisy web videos. We empirically study the curriculum constructed from the multi-modal features of the Internet videos and images. The comprehensive experimental results on FCVID and YFCC100M demonstrate that WELL-MM outperforms state-of-the-art studies by a statically significant margin on learning concepts from noisy web video data. In addition, the results also verify that WELL-MM is robust to the level of noisiness in the video data. Notably, WELL-MM trained on sufficient noisy web labels is able to achieve a better accuracy to supervised learning methods trained on the clean manually labeled data.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conditional Fast Style Transfer Network","authors":"Keiji Yanai, Ryosuke Tanno","doi":"10.1145/3078971.3079037","DOIUrl":"https://doi.org/10.1145/3078971.3079037","url":null,"abstract":"In this paper, we propose a conditional fast neural style transfer network. We extend the network proposed as a fast neural style transfer network by Johnson et al. [1] so that the network can learn multiple styles at the same time. To do that, we add a conditional input which selects a style to be transferred out of the trained styles. In addition, we show that the proposed network can mix multiple styles, although the network is trained with each of the training styles independently. The proposed network can also transfer different styles to the different parts of a given image at the same time, which we call \"spatial style transfer\". In the experiments, we confirmed that no quality degradation occurred in the multi-style network compared to the single network, and linear-weighted multi-style fusion enabled us to generate various kinds of new styles which are different from the trained single styles.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126546007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}