{"title":"Deep Multi-task Learning with Label Correlation Constraint for Video Concept Detection","authors":"Fotini Markatopoulou, V. Mezaris, I. Patras","doi":"10.1145/2964284.2967271","DOIUrl":"https://doi.org/10.1145/2964284.2967271","url":null,"abstract":"In this work we propose a method that integrates multi-task learning (MTL) and deep learning. Our method appends a MTL-like loss to a deep convolutional neural network, in order to learn the relations between tasks together at the same time, and also incorporates the label correlations between pairs of tasks. We apply the proposed method on a transfer learning scenario, where our objective is to fine-tune the parameters of a network that has been originally trained on a large-scale image dataset for concept detection, so that it be applied on a target video dataset and a corresponding new set of target concepts. We evaluate the proposed method for the video concept detection problem on the TRECVID 2013 Semantic Indexing dataset. Our results show that the proposed algorithm leads to better concept-based video annotation than existing state-of-the-art methods.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123188010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Digital World to Thrive In: How the Internet of Things Can Make the \"Invisible Hand\" Work","authors":"D. Helbing","doi":"10.1145/2964284.2984749","DOIUrl":"https://doi.org/10.1145/2964284.2984749","url":null,"abstract":"Managing data-rich societies wisely and reaching sustainable development are among the greatest challenges of the 21st century. We are faced with existential threats and huge opportunities. If we don't act now, large parts of our society will not be able to economically benefit from the digital revolution. This could lead to mass unemployment and social unrest. It is time to create the right framework for the digital society to come.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123853433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Morph: A Fast and Scalable Cloud Transcoding System","authors":"Guanyu Gao, Yonggang Wen","doi":"10.1145/2964284.2973792","DOIUrl":"https://doi.org/10.1145/2964284.2973792","url":null,"abstract":"Morph is an open source cloud transcoding system. It can leverage the scalability of the cloud infrastructure to encode and transcode video contents in fast speed, and dynamically provision the resources in cloud to accommodate the workload. The system is composed of a master node that performs the video file segmentation, concentration, and task scheduling operations; and multiple worker nodes that perform the transcoding for video blocks. Morph can transcode the video blocks of a video file on multiple workers in parallel to achieve fast speed, and automatically manage the data transfers and communications between the master node and the worker nodes. The worker nodes can join into or leave the transcoding cluster at any time for dynamic resource provisioning. The system is very modular, and all of the algorithms can be easily modified or replaced. We release the source code of Morph under MIT License, hoping that it can be shared among various research communities.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125189249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Visual Feedback Generation for Facial Expression Improvement with Multi-task Deep Neural Networks","authors":"Takuhiro Kaneko, Kaoru Hiramatsu, K. Kashino","doi":"10.1145/2964284.2967236","DOIUrl":"https://doi.org/10.1145/2964284.2967236","url":null,"abstract":"While many studies in computer vision and pattern recognition have been actively conducted to recognize people's current states, few studies have tackled the problem of generating feedback on how people can improve their states, although there are many real-world applications such as in sports, education, and health care. In particular, it has been challenging to develop such a system that can adaptively generate feedback for real-world situations, namely various input and target states, since it requires formulating various rules of feedback to do so. We propose a learning-based method to solve this problem. If we can obtain a large amount of feedback annotations, it is possible to explicitly learn the rules, but it is difficult to do so due to the subjective nature of the task. To mitigate this problem, our method implicitly learns the rules from training data consisting of input images, key-point annotations, and state annotations that do not require professional knowledge in feedback. Given such training data, we first learn a multi-task deep neural network with state recognition and key-point localization. Then, we apply a novel propagation method for extracting feedback information from the network. We evaluated our method in a facial expression improvement task using real-world data and clarified its characteristics and effectiveness.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125395556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Representation for Abnormal Event Detection in Crowded Scenes","authors":"Y. Feng, Yuan Yuan, Xiaoqiang Lu","doi":"10.1145/2964284.2967290","DOIUrl":"https://doi.org/10.1145/2964284.2967290","url":null,"abstract":"Abnormal event detection is extremely important, especially for video surveillance. Nowadays, many detectors have been proposed based on hand-crafted features. However, it remains challenging to effectively distinguish abnormal events from normal ones. This paper proposes a deep representation based algorithm which extracts features in an unsupervised fashion. Specially, appearance, texture, and short-term motion features are automatically learned and fused with stacked denoising autoencoders. Subsequently, long-term temporal clues are modeled with a long short-term memory (LSTM) recurrent network, in order to discover meaningful regularities of video events. The abnormal events are identified as samples which disobey these regularities. Moreover, this paper proposes a spatial anomaly detection strategy via manifold ranking, aiming at excluding false alarms. Experiments and comparisons on real world datasets show that the proposed algorithm outperforms state of the arts for the abnormal event detection problem in crowded scenes.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129510693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Liu, Yan Liu, Xiang Zhang, Gong Chen, Ke-jun Zhang
{"title":"Learning Music Emotion Primitives via Supervised Dynamic Clustering","authors":"Yang Liu, Yan Liu, Xiang Zhang, Gong Chen, Ke-jun Zhang","doi":"10.1145/2964284.2967215","DOIUrl":"https://doi.org/10.1145/2964284.2967215","url":null,"abstract":"This paper explores a fundamental problem in music emotion analysis, i.e., how to segment the music sequence into a set of basic emotive units, which are named as emotion primitives. Current works on music emotion analysis are mainly based on the fixed-length music segments, which often leads to the difficulty of accurate emotion recognition. Short music segment, such as an individual music frame, may fail to evoke emotion response. Long music segment, such as an entire song, may convey various emotions over time. Moreover, the minimum length of music segment varies depending on the types of the emotions. To address these problems, we propose a novel method dubbed supervised dynamic clustering (SDC) to automatically decompose the music sequence into meaningful segments with various lengths. First, the music sequence is represented by a set of music frames. Then, the music frames are clustered according to the valence-arousal values in the emotion space. The clustering results are used to initialize the music segmentation. After that, a dynamic programming scheme is employed to jointly optimize the subsequent segmentation and grouping in the music feature space. Experimental results on standard dataset show both the effectiveness and the rationality of the proposed method.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128293471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David S. Monaghan, Freddie Honohan, A. Ahmadi, T. McDaniel, Ramin Tadayon, Ajay Karpur, Kieran Moran, N. O’Connor, S. Panchanathan
{"title":"A Multimodal Gamified Platform for Real-Time User Feedback in Sports Performance","authors":"David S. Monaghan, Freddie Honohan, A. Ahmadi, T. McDaniel, Ramin Tadayon, Ajay Karpur, Kieran Moran, N. O’Connor, S. Panchanathan","doi":"10.1145/2964284.2973815","DOIUrl":"https://doi.org/10.1145/2964284.2973815","url":null,"abstract":"In this paper we introduce a novel platform that utilises multi-modal low-cost motion capture technology for the delivery of real-time visual feedback for sports performance. This platform supports the expansion to multi-modal interfaces that utilise haptic and audio feedback, which scales effectively with motor task complexity. We demonstrate an implementation of our platform within the field of sports performance. The platform includes low-cost motion capture through a fusion technique, combining a Microsoft Kinect V2 with two wrist inertial sensors, which make use of the accelerometer and gyroscope sensors, alongside a game-based Graphical User Interface (GUI) for instruction, visual feedback and gamified score tracking.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128534393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Domain Robust Approach For Image Dataset Construction","authors":"Yazhou Yao, Xiansheng Hua, Fumin Shen, Jian Zhang, Zhenmin Tang","doi":"10.1145/2964284.2967213","DOIUrl":"https://doi.org/10.1145/2964284.2967213","url":null,"abstract":"There have been increasing research interests in automatically constructing image dataset by collecting images from the Internet. However, existing methods tend to have a weak domain adaptation ability, known as the \"dataset bias problem\". To address this issue, in this work, we propose a novel image dataset construction framework which can generalize well to unseen target domains. In specific, the given queries are first expanded by searching in the Google Books Ngrams Corpora (GBNC) to obtain a richer semantic description, from which the noisy query expansions are then filtered out. By treating each expansion as a \"bag\" and the retrieved images therein as \"instances\", we formulate image filtering as a multi-instance learning (MIL) problem with constrained positive bags. By this approach, images from different data distributions will be kept while with noisy images filtered out. Comprehensive experiments on two challenging tasks demonstrate the effectiveness of our proposed approach.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130493415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-batch Reference Learning for Deep Classification and Retrieval","authors":"Huei-Fang Yang, Kevin Lin, Chu-Song Chen","doi":"10.1145/2964284.2964324","DOIUrl":"https://doi.org/10.1145/2964284.2964324","url":null,"abstract":"Learning feature representations for image retrieval is essential to multimedia search and mining applications. Recently, deep convolutional networks (CNNs) have gained much attention due to their impressive performance on object detection and image classification, and the feature representations learned from a large-scale generic dataset (e.g., ImageNet) can be transferred to or fine-tuned on the datasets of other domains. However, when the feature representations learned with a deep CNN are applied to image retrieval, the performance is still not as good as they are used for classification, which restricts their applicability to relevant image search. To ensure the retrieval capability of the learned feature space, we introduce a new idea called cross-batch reference (CBR) to enhance the stochastic-gradient-descent (SGD) training of CNNs. In each iteration of our training process, the network adjustment relies not only on the training samples in a single batch, but also on the information passed by the samples in the other batches. This inter-batches communication mechanism is formulated as a cross-batch retrieval process based on the mean average precision (MAP) criterion, where the relevant and irrelevant samples are encouraged to be placed on top and rear of the retrieval list, respectively. The learned feature space is not only discriminative to different classes, but the samples that are relevant to each other or of the same class are also enforced to be centralized. To maximize the cross-batch MAP, we design a loss function that is an approximated lower bound of the MAP on the feature layer of the network, which is differentiable and easier for optimization. By combining the intra-batch classification and inter-batch cross-reference losses, the learned features are effective for both classification and retrieval tasks. Experimental results on various benchmarks demonstrate the effectiveness of our approach.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128671067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, R. Cowie, M. Pantic
{"title":"Summary for AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge","authors":"M. Valstar, J. Gratch, Björn Schuller, F. Ringeval, R. Cowie, M. Pantic","doi":"10.1145/2964284.2980532","DOIUrl":"https://doi.org/10.1145/2964284.2980532","url":null,"abstract":"The sixth Audio-Visual Emotion Challenge and workshop AVEC 2016 was held in conjunction ACM Multimedia'16. This year the AVEC series addresses two distinct sub-challenges, multi-modal emotion recognition and audio-visual depression detection. Both sub-challenges are in a way a return to AVEC's past editions: the emotion sub-challenge is based on the same dataset as the one used in AVEC 2015, and depression analysis was previously addressed in AVEC 2013/2014. In this summary, we mainly describe participation and its conditions.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126950581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}