{"title":"Family Photo Recognition via Multiple Instance Learning","authors":"Junkang Zhang, Siyu Xia, Ming Shao, Y. Fu","doi":"10.1145/3078971.3079036","DOIUrl":"https://doi.org/10.1145/3078971.3079036","url":null,"abstract":"Family photo recognition is an important task in social media analytics. Previous methods use singleton global features and conventional binary classifiers to distinguish family group photos from non-family ones. Different from them, we propose a novel family recognition approach with three dedicated local representations under Multiple Instance Learning framework, where geometry, kinship and semantic features are integrated to overcome issues in the previous work. Experimental results show that our method achieves the state-of-the-art result among global-feature models.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"320 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122707197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visually Browsing Millions of Images Using Image Graphs","authors":"K. U. Barthel, N. Hezel, K. Jung","doi":"10.1145/3078971.3079016","DOIUrl":"https://doi.org/10.1145/3078971.3079016","url":null,"abstract":"We present a new approach to visually browse very large sets of untagged images. High quality image features are generated using transformed activations of a convolutional neural network. These features are used to model image similarities, from which a hierarchical image graph is build. We show how such a graph can be constructed efficiently. In our experiments we found best user experience for navigating the graph is achieved by projecting sub-graphs onto a regular 2D image map. This allows users to explore the image collection like an interactive map.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124828419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Baecchi, Tiberio Uricchio, M. Bertini, A. Bimbo
{"title":"Deep Sentiment Features of Context and Faces for Affective Video Analysis","authors":"C. Baecchi, Tiberio Uricchio, M. Bertini, A. Bimbo","doi":"10.1145/3078971.3079027","DOIUrl":"https://doi.org/10.1145/3078971.3079027","url":null,"abstract":"Given the huge quantity of hours of video available on video sharing platforms such as YouTube, Vimeo, etc. development of automatic tools that help users find videos that fit their interests has attracted the attention of both scientific and industrial communities. So far the majority of the works have addressed semantic analysis, to identify objects, scenes and events depicted in videos, but more recently affective analysis of videos has started to gain more attention. In this work we investigate the use of sentiment driven features to classify the induced sentiment of a video, i.e. the sentiment reaction of the user. Instead of using standard computer vision features such as CNN features or SIFT features trained to recognize objects and scenes, we exploit sentiment related features such as the ones provided by Deep-SentiBank, and features extracted from models that exploit deep networks trained on face expressions. We experiment on two recently introduced datasets: LIRIS-ACCEDE and MEDIAEVAL-2015, that provide sentiment annotations of a large set of short videos. We show that our approach not only outperforms the current state-of-the-art in terms of valence and arousal classification accuracy, but it also uses a smaller number of features, requiring thus less video processing.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127792744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuikui Wang, Lu Yang, Gongping Yang, Xin Luo, Kun Su, Yilong Yin
{"title":"Finger Vein Image Retrieval via Coding Scale-varied Superpixel Feature","authors":"Kuikui Wang, Lu Yang, Gongping Yang, Xin Luo, Kun Su, Yilong Yin","doi":"10.1145/3078971.3078975","DOIUrl":"https://doi.org/10.1145/3078971.3078975","url":null,"abstract":"Finger vein image retrieval is one significant technique for performing fast identification especially in large-scale applications. However, most existing retrieval methods were based on fixed-scale feature of non-overlapped rectangular image block, in which the representation ability of feature and the local consistency of vein pattern were both overlooked. And the weak encoding (e.g., predefined threshold based binarization) was also limited the retrieval performance. Focusing on these problems, this paper proposes a novel finger vein image retrieval framework based on similarity-preserving encoding of scale-varied superpixel feature. In the framework, locally consistent pixels in one superpixel are used as a unit of feature representation, and the feature length is varied with the category of the superpixel classified by the variance of lowest dimensional feature. Additionally, the feature compaction and feature rotation based encoding can minimize the quantization loss and preserve the similarity between the scale-varied feature and the encoded binary codes. Experimental results on six public finger vein databases demonstrate that the superiority of the proposed coding scale-varied superpixel feature based retrieval approach over the state-of-the-arts.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124510042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intelligently Connecting People with Information","authors":"Changhu Wang","doi":"10.1145/3078971.3081371","DOIUrl":"https://doi.org/10.1145/3078971.3081371","url":null,"abstract":"How to effectively connect people with information is a fundamental problem in human society. We are now in the era of mobile first, and everything is digitally connected. With the advent of diverse social contents, information feeds have become a new way to connect people with information. Thus, there is a pretty good opportunity for artificial intelligence (AI) to make innovations in this direction. AI can make more efficient and intelligent the creation, moderation, dissemination, searching, consumption, and interaction of information and contents. As an industry leader in the product platform and service of information feeds, Toutiao takes the lead to develop and leverage diverse machine learning techniques to efficiently process, analyze, mine, understand, and organize a large amount of multimedia data. Meanwhile, owning to its rich application scenarios and active users all over the world, we have accumulated huge amount of training data, which makes the machine learning system form a closed feedback loop and thus can continually improve and evolve itself. This closed-loop system enables Toutiao to develop core AI technologies in large-scale machine learning, text analysis, natural language processing, computer vision, and data mining. In this talk, I will share some personal opinions to the development prospects of AI in this fundamental area, including my understanding to AI, important research progress in recent years, the influence of AI to the software industry, and how to build the core competence strategy of AI in a company. Moreover, I will also introduce some research progress of Toutiao AI Lab.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116702096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Unsupervised Distance Learning Framework for Multimedia Retrieval","authors":"Lucas Pascotti Valem, D. C. G. Pedronette","doi":"10.1145/3078971.3079017","DOIUrl":"https://doi.org/10.1145/3078971.3079017","url":null,"abstract":"Due to the increasing availability of image and multimedia collections, unsupervised post-processing methods, which are capable of improving the effectiveness of retrieval results without the need of user intervention, have become indispensable. This paper presents the Unsupervised Distance Learning Framework (UDLF), a software which enables an easy use and evaluation of unsupervised learning methods. The framework defines a broad model, allowing the implementation of different unsupervised methods and supporting diverse file formats for input and output. Seven different unsupervised methods are initially available in the framework. Executions and experiments can be easily defined by setting a configuration file. The framework also includes the evaluation of the retrieval results exporting visual output results, computing effectiveness and efficiency measures. The source-code is public available, such that anyone can freely access, use, change, and share the software under the terms of the GPLv2 license.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132559220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information Retrieval from Multi-Sensor Data for Enriching Location Services at HERE Technologies","authors":"Matei Stroila","doi":"10.1145/3078971.3081370","DOIUrl":"https://doi.org/10.1145/3078971.3081370","url":null,"abstract":"HERE Technologies provides real-time location services that enable people, enterprises, and cities around the world to harness the power of location and create innovative solutions for a safer and more efficient living. Multimedia retrieval techniques and sensor fusion approaches are essential for enriching location services and for keeping the underlying map up to date. In this talk, I will give an overview of some of the work we do in the CTO Research group to support existing location services and enable future ones. We aim to automatically extract useful information from massive collections of images, LiDAR point clouds, car sensor data and open web data. I will present work related to image recognition for map making purposes, information retrieval for points of interest enrichment, and work related to creating a highly accurate map of the roads and cities for the future autonomous navigation services.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Tutorials","authors":"G. Awad","doi":"10.1145/3254614","DOIUrl":"https://doi.org/10.1145/3254614","url":null,"abstract":"","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115663355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Image Classification using Coarse and Fine Labels","authors":"Anuvabh Dutt, D. Pellerin, G. Quénot","doi":"10.1145/3078971.3079042","DOIUrl":"https://doi.org/10.1145/3078971.3079042","url":null,"abstract":"The performance of classifiers is in general improved by designing models with a large number of parameters or by ensembles. We tackle the problem of classification of coarse and fine grained categories, which share a semantic relationship. On being given the predictions that a classifier has for a given test sample, we adjust the probabilities according to the semantics of the categories, on which the classifier was trained. We present an algorithm for doing such an adjustment and we demonstrate improvement for both coarse and fine grained classification. We evaluate our method using convolutional neural networks. However, the algorithm can be applied to any classifier which outputs category wise probabilities.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116589690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture","authors":"Olga Slizovskaia, E. Gómez, G. Haro","doi":"10.1145/3078971.3079002","DOIUrl":"https://doi.org/10.1145/3078971.3079002","url":null,"abstract":"This paper presents a method for recognizing musical instruments in user-generated videos. Musical instrument recognition from music signals is a well-known task in the music information retrieval (MIR) field, where current approaches rely on the analysis of the good-quality audio material. This work addresses a real-world scenario with several research challenges, i.e. the analysis of user-generated videos that are varied in terms of recording conditions and quality and may contain multiple instruments sounding simultaneously and background noise. Our approach does not only focus on the analysis of audio information, but we exploit the multimodal information embedded in the audio and visual domains. In order to do so, we develop a Convolutional Neural Network (CNN) architecture which combines learned representations from both modalities at a late fusion stage. Our approach is trained and evaluated on two large-scale video datasets: YouTube-8M and FCVID. The proposed architectures demonstrate state-of-the-art results in audio and video object recognition, provide additional robustness to missing modalities, and remains computationally cheap to train.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126228058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}