B. Dong, C. Lumezanu, Yuncong Chen, Dongjin Song, Takehiko Mizoguchi, Haifeng Chen, L. Khan
{"title":"At the Speed of Sound: Efficient Audio Scene Classification","authors":"B. Dong, C. Lumezanu, Yuncong Chen, Dongjin Song, Takehiko Mizoguchi, Haifeng Chen, L. Khan","doi":"10.1145/3372278.3390730","DOIUrl":"https://doi.org/10.1145/3372278.3390730","url":null,"abstract":"Efficient audio scene classification is essential for smart sensing platforms such as robots, medical monitoring, surveillance, or autonomous vehicles. We propose a retrieval-based scene classification architecture that combines recurrent neural networks and attention to compute embeddings for short audio segments. We train our framework using a custom audio loss function that captures both the relevance of audio segments within a scene and that of sound events within a segment. Using experiments on real audio scenes, we show that we can discriminate audio scenes with high accuracy after listening in for less than a second. This preserves 93% of the detection accuracy obtained after hearing the entire scene.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122197209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automation of Deep Learning - Theory and Practice","authors":"Martin Wistuba, Ambrish Rawat, Tejaswini Pedapati","doi":"10.1145/3372278.3390739","DOIUrl":"https://doi.org/10.1145/3372278.3390739","url":null,"abstract":"The growing interest in both the automation of machine learning and deep learning has inevitably led to the development of a wide variety of methods to automate deep learning. The choice of network architecture has proven critical, and many improvements in deep learning are due to new structuring of it. However, deep learning techniques are computationally intensive and their use requires a high level of domain knowledge. Even a partial automation of this process therefore helps to make deep learning more accessible for everyone. In this tutorial we present a uniform formalism that enables different methods to be categorized and compare the different approaches in terms of their performance. We achieve this through a comprehensive discussion of the commonly used architecture search spaces and architecture optimization algorithms based on reinforcement learning and evolutionary algorithms as well as approaches that include surrogate and one-shot models. In addition, we discuss approaches to accelerate the search for neural architectures based on early termination and transfer learning and address the new research directions, which include constrained and multi-objective architecture search as well as the automated search for data augmentation, optimizers, and activation functions.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"System Fusion with Deep Ensembles","authors":"Liviu-Daniel Stefan, M. Constantin, B. Ionescu","doi":"10.1145/3372278.3390720","DOIUrl":"https://doi.org/10.1145/3372278.3390720","url":null,"abstract":"Deep neural networks (DNNs) are universal estimators that have achieved state-of-the-art performance in a broad spectrum of classification tasks, opening new perspectives for many applications. One of them is addressing ensemble learning. In this paper, we introduce a set of deep learning techniques for ensemble learning with dense, attention, and convolutional neural network layers. Our approach automatically discovers patterns and correlations between the decisions of individual classifiers, therefore, alleviating the difficulty of building such architectures. To assess its robustness, we evaluate our approach on two complex data sets that target different perspectives of predicting the user perception of multimedia data, i.e., interestingness and violence. The proposed approach outperforms the existing state-of-the-art algorithms by a large margin.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dejie Yang, Dayan Wu, Wanqian Zhang, Haisu Zhang, Bo Li, Weiping Wang
{"title":"Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval","authors":"Dejie Yang, Dayan Wu, Wanqian Zhang, Haisu Zhang, Bo Li, Weiping Wang","doi":"10.1145/3372278.3390673","DOIUrl":"https://doi.org/10.1145/3372278.3390673","url":null,"abstract":"Deep hashing methods have achieved tremendous success in cross-modal retrieval, due to its low storage consumption and fast retrieval speed. In real cross-modal retrieval applications, it's hard to obtain label information. Recently, increasing attention has been paid to unsupervised cross-modal hashing. However, existing methods fail to exploit the intrinsic connections between images and their corresponding descriptions or tags (text modality). In this paper, we propose a novel Deep Semantic-Alignment Hashing (DSAH) for unsupervised cross-modal retrieval, which sufficiently utilizes the co-occurred image-text pairs. DSAH explores the similarity information of different modalities and we elaborately design a semantic-alignment loss function, which elegantly aligns the similarities between features with those between hash codes. Moreover, to further bridge the modality gap, we innovatively propose to reconstruct features of one modality with hash codes of the other one. Extensive experiments on three cross-modal retrieval datasets demonstrate that DSAH achieves the state-of-the-art performance.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images","authors":"Zichen Zan, Lin Li, Jianquan Liu, D. Zhou","doi":"10.1145/3372278.3390681","DOIUrl":"https://doi.org/10.1145/3372278.3390681","url":null,"abstract":"In recent years, people are facing with billions of food images, videos and recipes on social medias. An appropriate technology is highly desired to retrieve accurate contents across food images and cooking recipes, like cross-modal retrieval framework. Based on our observations, the order of sequential sentences in recipes and the noises in food images will affect retrieval results. We take into account the sentence-level sequential orders of instructions and ingredients in recipes, and noise portion in food images to propose a new framework for cross-retrieval. In our framework, we propose three new strategies to improve the retrieval accuracy. (1) We encode recipe titles, ingredients, instructions in sentence level, and adopt three attention networks on multi-layer hidden state features separately to capture more semantic information. (2) We apply attention mechanism to select effective features from food images incorporating with recipe embeddings, and adopt an adversarial learning strategy to enhance modality alignment. (3) We design a new triplet loss scheme with an effective sampling strategy to reduce the noise impact on retrieval results. The experimental results show that our framework clearly outperforms the state-of-art methods in terms of median rank and recall rate at top k on the Recipe 1M dataset.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125882036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anomaly Detection in Traffic Surveillance Videos with GAN-based Future Frame Prediction","authors":"Khac-Tuan Nguyen, Dat-Thanh Dinh, M. Do, M. Tran","doi":"10.1145/3372278.3390701","DOIUrl":"https://doi.org/10.1145/3372278.3390701","url":null,"abstract":"It is essential to develop efficient methods to detect abnormal events, such as car-crashes or stalled vehicles, from surveillance cameras to provide in-time help. This motivates us to propose a novel method to detect traffic accidents in traffic videos. To tackle the problem where anomalies only occupy a small amount of data, we propose a semi-supervised method using Generative Adversarial Network trained on regular sequences to predict future frames. Our key idea is to model the ordinary world with a generative model, then compare a predicted frame with the real next frame to determine if an abnormal event occurs. We also propose a new idea of encoding motion descriptors and scaled intensity loss function to optimize GAN for fast-moving objects. Experiments on the Traffic Anomaly Detection dataset of AI City Challenge 2019 show that our method achieves the top 3 results with F1 score 0.9412 and RMSE 4.8088, and S3 score 0.9261. Our method can be applied to different related applications of anomaly and outlier detection in videos.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One Shot Logo Recognition Based on Siamese Neural Networks","authors":"Camilo Vargas, Qianni Zhang, E. Izquierdo","doi":"10.1145/3372278.3390734","DOIUrl":"https://doi.org/10.1145/3372278.3390734","url":null,"abstract":"This work presents an approach for one-shot logo recognition that relies on a Siamese neural network (SNN) embedded with a pre-trained model that is fine-tuned on a challenging logo dataset. Although the model is fine-tuned using logo images, the training and testing datasets do not have overlapped categories; meaning that, all the classes used for testing the one-shot recognition framework remain unseen during the fine-tuning process. The recognition process follows the standard SNN approach in which a pair of input images are encoded by each sister network. The encoded outputs for each image are afterwards compared using a trained metric and thresholded to define matches and mismatches. The proposed approach achieves an accuracy of 77.07% under the one-shot constraints in the QMUL-OpenLogo dataset. Code is available at https://github.com/cjvargasc/oneshot_siamese/.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131072978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thinhinane Yebda, J. Benois-Pineau, M. Pech, H. Amièva, C. Gurrin
{"title":"Detection of Semantic Risk Situations in Lifelog Data for Improving Life of Frail People","authors":"Thinhinane Yebda, J. Benois-Pineau, M. Pech, H. Amièva, C. Gurrin","doi":"10.1145/3372278.3391931","DOIUrl":"https://doi.org/10.1145/3372278.3391931","url":null,"abstract":"The automatic recognition of risk situations for frail people is an urgent research topic for the interdisciplinary artificial intelligence and multimedia community. Risky situations can be recognized from lifelog data recorded with wearable devices. In this paper, we present a new approach for the detection of semantic risk situations for frail people in lifelog data. Concept matching between general lifelog and risk taxonomies was realized and tuned AlexNet was deployed for detection of two semantic risks situations such as risk of domestic accident and risk of fraud with promising results.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134451991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Chu, I. Ide, Naoko Nitta, N. Tsumura, T. Yamasaki
{"title":"MMArt-ACM'20: International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia 2020","authors":"W. Chu, I. Ide, Naoko Nitta, N. Tsumura, T. Yamasaki","doi":"10.1145/3372278.3388042","DOIUrl":"https://doi.org/10.1145/3372278.3388042","url":null,"abstract":"The International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia (MMArt-ACM) solicits contributions on methodology advancement and novel applications of multimedia artworks and attractiveness computing that emerge in the era of big data and deep learning. Despite the strike of the Covid-19 pandemic, this workshop attracts submissions of diverse topics in these two fields, and the workshop program finally consists of five presented papers. The topics cover image retrieval, image transformation and generation, recommendation system, and image/video summarization. The actual MMArt-ACM'20 Proceedings are available in the ACM DL at: https://dl.acm.org/citation.cfm?id=3379173","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125532361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Lightweight Gated Global Module for Global Context Modeling in Neural Networks","authors":"Li Hao, Liping Hou, Yuantao Song, K. Lu, Jian Xue","doi":"10.1145/3372278.3390712","DOIUrl":"https://doi.org/10.1145/3372278.3390712","url":null,"abstract":"Global context modeling has been used to achieve better performance in various computer-vision-related tasks, such as classification, detection, segmentation and multimedia retrieval applications. However, most of the existing global mechanisms display problems regarding convergence during training. In this paper, we propose a novel gated global module (GGM) that is lightweight and yet effective in terms of achieving better integration of global information in relation to feature representation. Regarding the original structure of the network as a local block, our module infers global information in parallel with local information, and then a gate function is applied to generate global guidance which is applied to the output of the local module to capture representative information. The proposed GGM can be easily integrated with common CNN architectures and is training friendly. We used a classification task as an example to verify the effectiveness of the proposed GGM, and extensive experiments on ImageNet and CIFAR demonstrated that our method can be widely applied and is conducive to integrating global information into common networks.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"117 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126403908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}