{"title":"Modular Parallelization Framework for Multi-Stream Video Processing","authors":"Tim Lenertz, G. Lafruit","doi":"10.1145/2964284.2973799","DOIUrl":"https://doi.org/10.1145/2964284.2973799","url":null,"abstract":"A flow-based software framework specialized for 3D and video is presented, which in particular handles automatic parallelization of multi-stream video processing. The workflow is decomposed into a set of filter units which become nodes in a graph. Execution with support for time windows on node inputs is automatically handled by the framework, as well as low-level manipulation of multi-dimensional data.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117073759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca Rossetto, Ivan Giangreco, Claudiu Tanase, H. Schuldt
{"title":"vitrivr: A Flexible Retrieval Stack Supporting Multiple Query Modes for Searching in Multimedia Collections","authors":"Luca Rossetto, Ivan Giangreco, Claudiu Tanase, H. Schuldt","doi":"10.1145/2964284.2973797","DOIUrl":"https://doi.org/10.1145/2964284.2973797","url":null,"abstract":"vitrivr is an open source full-stack content-based multimedia retrieval system with focus on video. Unlike the majority of the existing multimedia search solutions, vitrivr is not limited to searching in metadata, but also provides content-based search and thus offers a large variety of different query modes which can be seamlessly combined: Query by sketch, which allows the user to draw a sketch of a query image and/or sketch motion paths, Query by example, keyword search, and relevance feedback. The vitrivr architecture is self-contained and addresses all aspects of multimedia search, from offline feature extraction, database management to frontend user interaction. The system is composed of three modules: a web-based frontend which allows the user to input the query (e.g., add a sketch) and browse the retrieved results (vitrivr-ui), a database system designed for interactive search in large-scale multimedia collections (ADAM), and a retrieval engine that handles feature extraction and feature-based retrieval (Cineast). The vitrivr source is available on GitHub under the MIT open source (and similar) licenses and is currently undergoing several upgrades as part of the Google Summer of Code 2016.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115265089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted Linear Fusion of Multimodal Data: A Reasonable Baseline?","authors":"Ognjen Arandjelovic","doi":"10.1145/2964284.2964304","DOIUrl":"https://doi.org/10.1145/2964284.2964304","url":null,"abstract":"The ever-increasing demand for reliable inference capable of handling unpredictable challenges of practical application in the real world, has made research on information fusion of major importance. There are few fields of application and research where this is more evident than in the sphere of multimedia which by its very nature inherently involves the use of multiple modalities, be it for learning, prediction, or human-computer interaction, say. In the development of the most common type, score-level fusion algorithms, it is virtually without an exception desirable to have as a reference starting point a simple and universally sound baseline benchmark which newly developed approaches can be compared to. One of the most pervasively used methods is that of weighted linear fusion. It has cemented itself as the default off-the-shelf baseline owing to its simplicity of implementation, interpretability, and surprisingly competitive performance across a wide range of application domains and information source types. In this paper I argue that despite this track record, weighted linear fusion is not a good baseline on the grounds that there is an equally simple and interpretable alternative - namely quadratic mean-based fusion - which is theoretically more principled and which is more successful in practice. I argue the former from first principles and demonstrate the latter using a series of experiments on a diverse set of fusion problems: computer vision-based object recognition, arrhythmia detection, and fatality prediction in motor vehicle accidents.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"347 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115411172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Digital Holographic Image Reconstruction on Mobile Devices","authors":"Chung-Hua Chu","doi":"10.1145/2964284.2967192","DOIUrl":"https://doi.org/10.1145/2964284.2967192","url":null,"abstract":"Advanced digital holography attracts a lot of attentions for 3D visualization nowadays. The representation of digital holographic images suffers from computational inefficiency on the mobile devices due to the limited hardware for digital holographic processing. In this paper, we point out that the above critical issue deteriorates the digital holographic image representation on the mobile devices. To reduce computational complexity in digital holographic image reconstruction, we propose an efficient and effective algorithm to simplify Fresnel transforms for the mobile devices. Our algorithm outperforms previous approaches in not only smaller running time but also the better quality of the digital holographic image representation for the mobile devices.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115451444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SenseCap: Synchronized Data Collection with Microsoft Kinect2 and LeapMotion","authors":"Julian F. P. Kooij","doi":"10.1145/2964284.2973805","DOIUrl":"https://doi.org/10.1145/2964284.2973805","url":null,"abstract":"We present a new recording tool to capture synchronized video and skeletal data streams from cheap sensors such as the Microsoft Kinect2, and LeapMotion. While other recording tools act as virtual playback devices for testing on-line real-time applications, we target multi-media data collection for off-line processing. Images are encoded in common video formats, and skeletal data as flat text tables. This approach enables long duration recordings (e.g. over 30 minutes), and supports post-hoc mapping of the Kinect2 depth video to the color space if needed. By using common file formats, the data can be played back and analyzed on any other computer, without requiring sensor specific SDKs to be installed. The project is released under a 3-clause BSD license, and consists of an extensible C++11 framework, with support for the official Microsoft Kinect 2 and LeapMotion APIs to record, a command-line interface, and a Matlab GUI to initiate, inspect, and load Kinect2 recordings.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116137609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bingchen Gong, Brendan Jou, Felix X. Yu, Shih-Fu Chang
{"title":"Tamp: A Library for Compact Deep Neural Networks with Structured Matrices","authors":"Bingchen Gong, Brendan Jou, Felix X. Yu, Shih-Fu Chang","doi":"10.1145/2964284.2973802","DOIUrl":"https://doi.org/10.1145/2964284.2973802","url":null,"abstract":"We introduce Tamp, an open source C++ library for reducing the space and time costs of deep neural network models. In particular, Tamp implements several recent works which use structured matrices to replace unstructured matrices which are often bottlenecks in neural networks. Tamp is also designed to serve as a unified development platform with several supported optimization back-ends and abstracted data types. This paper introduces the design and API and also demonstrates the effectiveness with experiments on public datasets.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127448642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Emotion Computing","authors":"Sicheng Zhao","doi":"10.1145/2964284.2971473","DOIUrl":"https://doi.org/10.1145/2964284.2971473","url":null,"abstract":"Images can convey rich semantics and induce strong emotions in viewers. My research aims to predict image emotions from different aspects with respect to two main challenges: affective gap and subjective evaluation. To bridge the affective gap, we extract emotion features based on principles-of-art to recognize image-centric dominant emotions. As the emotions that are induced in viewers by an image are highly subjective and different, we propose to predict user-centric personalized emotion perceptions for each viewer and image-centric emotion probability distribution for each image. To tackle the subjective evaluation issue, we set up a large scale image emotion dataset from Flickr, named Image-Emotion-Social-Net, on both dimensional and categorical emotion representations with over 1 million images and about 8,000 users. Different types of factors may influence personalized image emotion perceptions, including visual content, social context, temporal evolution and location influence. We make an initial attempt to jointly combine them by the proposed rolling multi-task hypergraph learning. Both discrete and continuous emotion distributions are modelled via shared sparse learning. Further, several potential applications based on image emotions are designed and implemented.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125730779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frustratingly Easy Cross-Modal Hashing","authors":"Dekui Ma, Jian Liang, Xiangwei Kong, R. He","doi":"10.1145/2964284.2967218","DOIUrl":"https://doi.org/10.1145/2964284.2967218","url":null,"abstract":"Cross-modal hashing has attracted considerable attention due to its low storage cost and fast retrieval speed. Recently, more and more sophisticated researches related to this topic are proposed. However, they seem to be inefficient computationally for several reasons. On one hand, learning coupled hash projections makes the iterative optimization problem challenging. On the other hand, individual collective binary codes for each content are also learned with a high computation complexity. In this paper we describe a simple yet effective cross-modal hashing approach that can be implemented in just three lines of code. This approach first obtains the binary codes for one modality via unimodal hashing methods (e.g., iterative quantization (ITQ)), then applies simple linear regression to project the other modalities into the obtained binary subspace. Obviously, it is non-iterative and parameter-free, which makes it more attractive for many real-world applications. We further compare our approach with other state-of-the-art methods on four benchmark datasets (i.e., the Wiki, VOC, LabelMe and NUS-WIDE datasets). Despite its extraordinary simplicity, our approach performs remarkably and generally well for these datasets under different experimental settings (i.e., large-scale, high-dimensional and multi-label datasets).","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Music Video Generation Based on Emotion-Oriented Pseudo Song Prediction and Matching","authors":"Jen-Chun Lin, Wen-Li Wei, H. Wang","doi":"10.1145/2964284.2967245","DOIUrl":"https://doi.org/10.1145/2964284.2967245","url":null,"abstract":"The main difficulty in automatic music video (MV) generation lies in how to match two different media (i.e., video and music). This paper proposes a novel content-based MV generation system based on emotion-oriented pseudo song prediction and matching. We use a multi-task deep neural network (MDNN) to jointly learn the relationship among music, video, and emotion from an emotion-annotated MV corpus. Given a queried video, the MDNN is applied to predict the acoustic (music) features from the visual (video) features, i.e., the pseudo song corresponding to the video. Then, the pseudo acoustic (music) features are matched with the acoustic (music) features of each music track in the music collection according to a pseudo-song-based deep similarity matching (PDSM) metric given by another deep neural network (DNN) trained on the acoustic and pseudo acoustic features of the positive (official), less-positive (artificial), and negative (artificial) MV examples. The results of objective and subjective experiments demonstrate that the proposed pseudo-song-based framework performs well and can generate appealing MVs with better viewing and listening experiences.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124878298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sentiment and Emotion Analysis for Social Multimedia: Methodologies and Applications","authors":"Quanzeng You","doi":"10.1145/2964284.2971475","DOIUrl":"https://doi.org/10.1145/2964284.2971475","url":null,"abstract":"Online social networks have attracted the attention from both the academia and real world. In particular, the rich multimedia information accumulated in recent years provides an easy and convenient way for more active communication between people. This offers an opportunity to research people's behaviors and activities based on those multimedia content. One emerging area is driven by the fact that these massive multimedia data contain people's daily sentiments and opinions. However, existing sentiment analysis typically focuses on textual information regardless of the visual content, which may be as informative in expressing people's sentiments and opinions. In this research, we attempt to analyze the online sentiment changes of social media users using both the textual and visual content. Nowadays, social media networks such as Twitter have become major platforms of information exchange and communication between users, with tweets as the common information carrier. As an old saying has it, an image is worth a thousand words. The image tweet is a great example of multimodal sentiment. In this research, we focus on sentiment analysis based on visual and multimedia information analysis. We will review the state-of-the-art in this topic. Several of our projects related to this research area will also be discussed. Experimental results are included to demonstrate and summarize our contributions.","PeriodicalId":140670,"journal":{"name":"Proceedings of the 24th ACM international conference on Multimedia","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124317652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}