{"title":"Recognizing actions using salient features","authors":"Liang Wang, Debin Zhao","doi":"10.1109/MMSP.2011.6093832","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093832","url":null,"abstract":"Towards a compact video feature representation, we propose a novel feature selection methodology for action recognition based on the saliency maps of videos. Since saliency maps measure the perceptual importance of the pixels and regions in videos, selecting features using saliency maps enables us to find a feature representation that covers the informative parts of a video. Because saliency detection is a bottom-up procedure, some appearance changes or motions that are irrelevant to actions may also be detected as salient regions. To further improve the purity of the feature representation, we prune these irrelevant salient regions using the saliency values distribution and the spatial-temporal distribution of the salient regions. Extensive experiments are conducted to demonstrate that the proposed feature selection method largely improves the performance of bag-of-video-words model on action recognition based on three different attention models including a static attention model, a motion attention model and their combination.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"65 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114529890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A system for dynamic playlist generation driven by multimodal control signals and descriptors","authors":"Luca Chiarandini, M. Zanoni, A. Sarti","doi":"10.1109/MMSP.2011.6093850","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093850","url":null,"abstract":"This work describes a general approach to multimedia playlist generation and description and an application of the approach to music information retrieval. The example of system that we implemented updates a musical playlist on the fly based on prior information (musical preferences); current descriptors of the song that is being played; and fine-grained and semantically rich descriptors (descriptors of user's gestures, of environment conditions, etc.). The system incorporates a learning system that infers the user's preferences. Subjective tests have been conducted on usability and quality of the recommendation system.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Age estimation based on extended non-negative matrix factorization","authors":"Ce Zhan, W. Li, P. Ogunbona","doi":"10.1109/MMSP.2011.6093779","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093779","url":null,"abstract":"Previous studies suggested that local appearance-based methods are more efficient than geometric-based and holistic methods for age estimation. This is mainly due to the fact that age information are usually encoded by the local features such as wrinkles and skin texture on the forehead or at the eye corners. However, the variations of theses features caused by other factors such as identity, expression, pose and lighting may be larger than that caused by aging. Thus, one of the key challenges of age estimation lies in constructing a feature space that could successfully recovers age information while ignoring other sources of variations. In this paper, non-negative matrix factorization (NMF) is extended to learn a localized non-overlapping subspace representation for age estimation. To emphasize the appearance variation in aging, one individual extended NMF subspace is learned for each age or age group. The age or age group of a given face image is then estimated based on its reconstruction error after being projected into the learned age subspaces. Furthermore, a coarse to fine scheme is employed for exact age estimation, so that the age is estimated within the pre-classified age groups. Cross-database tests are conducted using FG-NET and MORPH databases to evaluate the proposed method. Experimental results have demonstrated the efficacy of the method.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121638809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compression of compound images by combining several strategies","authors":"Cuiling Lan, Jizheng Xu, Feng Wu","doi":"10.1109/MMSP.2011.6093824","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093824","url":null,"abstract":"Compound images are combinations of text, graphics and natural images. They possess characteristics different from those of natural images, such as a strong anisotropy, sparse color histograms and repeated patterns. Former research on compressing them has mainly focused on developing certain strategies based on some of these characteristics but has failed so far to fully exploit them simultaneously. In this paper, we investigate the combination of four up-to-date strategies to construct a comprehensive scheme for compound image compression. We have implemented these strategies as four types of modes with variable block sizes. Experimental results show that the proposed scheme achieves significant coding gains for compound image compression at all bitrates.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125156449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A flexible markerless registration method for video augmented reality","authors":"L. Ling, I. Burnett, E. Cheng","doi":"10.1109/MMSP.2011.6093790","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093790","url":null,"abstract":"This paper proposes a flexible, markerless registration method that addresses the problem of realistic virtual object placement at any position in a video sequence. The registration consists of two steps: four points are specified by the user to build the world coordinate system, where the virtual object is rendered. A self-calibration camera tracking algorithm is then proposed to recover the camera viewpoint frame-by-frame, such that the virtual object can be dynamically and correctly rendered according to camera movement. The proposed registration method needs no reference fiducials, knowledge of camera parameters or the user environment, where the virtual object can be placed in any environment even without any distinct features. Experimental evaluations demonstrate low errors for several camera motion rotations around the X and Y axes for the self-calibration algorithm. Finally, virtual object rendering applications in different user environments are evaluated.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134114894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-dimensional correlation steganalysis","authors":"F. Farhat, A. Diyanat, S. Ghaemmaghami, M. Aref","doi":"10.1109/MMSP.2011.6093791","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093791","url":null,"abstract":"Multi-dimensional spatial analysis of image pixels have not been much investigated for the steganalysis of the LSB Steganographic methods. Pixel distribution based steganalysis methods could be thwarted by intelligently compensating statistical characteristics of image pixels, as reported in several papers. Simple LSB replacement methods have been improved by introducing smarter LSB embedding approaches, e.g. LSB matching and LSB+ methods, but they are basically the same in the sense of the LSB alteration. A new analytical method to detect LSB stego images is proposed in this paper. Our approach is based on the relative locations of image pixels that are essentially changed in an LSB embedding system. Furthermore, we introduce some new statistical features including “local entropies sum” and “clouds min sum” to achieve a higher performance. Simulation results show that our proposed approach outperforms some well-known LSB steganalysis methods, in terms of detection accuracy and the embedding rate estimation.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134126047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ObjectBook construction for large-scale semantic-aware image retrieval","authors":"Shiliang Zhang, Q. Tian, Qingming Huang, Wen Gao","doi":"10.1109/MMSP.2011.6093776","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093776","url":null,"abstract":"Automatic image annotation assigns semantic labels to images thus presents great potential to achieve semantic-aware image retrieval. However, existing annotation algorithms are not scalable to this emerging need, both in terms of computational efficiency and the number of tags they can deal with. Facilitated by recent development of the large-scale image category recognition data such as ImageNet, we extrapolate from it a model for scalable image annotation and semantic-aware image retrieval, namely ObjectBook. The element in the ObjectBook, which is called an ObjectWord, is defined as a collection of discriminative image patches annotated with the corresponding objects. We take ObjectBook as a high-level semantic preserving visual vocabulary, and hence are able to easily develop efficient image annotation and inverted file indexing strategies for large-scale image collections. The proposed retrieval strategy is compared with state-of-the-art algorithms. Experimental results manifest that the ObjectBook is both discriminative and scalable for large-scale semantic-aware image retrieval.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133434519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transform-domain temporal prediction in video coding with spatially adaptive spectral correlations","authors":"Jingning Han, Vinay Melkote, K. Rose","doi":"10.1109/MMSP.2011.6093815","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093815","url":null,"abstract":"Temporal prediction in standard video coding is performed in the spatial domain, where each pixel block is predicted from a motion-compensated pixel block in a previously reconstructed frame. Such prediction treats each pixel independently and ignores underlying spatial correlations. In contrast, this paper proposes a paradigm for motion-compensated prediction in the transform domain, that eliminates much of the spatial correlation before individual frequency components along a motion trajectory are independently predicted. The proposed scheme exploits the true temporal correlations, that emerge only after signal decomposition, and vary considerably from low to high frequency. The scheme spatially and temporally adapts to the evolving source statistics via a recursive procedure to obtain the cross-correlation between transform coefficients on the same motion trajectory. This recursion involves already reconstructed data and precludes the need for any additional side-information in the bit-stream. Experiments demonstrate substantial performance gains in comparison with the standard codec that employs conventional pixel domain motion-compensated prediction.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125773011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Super-resolution reconstruction with prior manifold on primitive patches for video compression","authors":"Jingtao Chen, H. Xiong","doi":"10.1109/MMSP.2011.6093849","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093849","url":null,"abstract":"This paper proposes a generic video compression framework with low-quality video data and a learning-based approach, which is rooted in sparse representation for the ill-posed problem of video super-resolution reconstruction. It is regularized by the prior manifold only on the “primitive patches”, and each primitive patch is modeled by a sparse representation concerning an over-complete dictionary of trained set. Due to low intrinsic dimensionality of primitives, the number of samples in the dictionary can be greatly reduced. Considering the similar geometry of the manifolds of the feature spaces from the low-frequency and the high-frequency primitives, we hypothesize that the low-frequency and its corresponding high-frequency primitive patches share the same sparse representation structure. In this sense, high-resolution frame primitives are divided into low-frequency and high-frequency frame primitives, and high-frequency frame primitive patches can be synthesized from both the high-frequency primitive patch dictionary and the sparse structure of the corresponding low-frequency frame primitive patches. It does not involve with explicit motion estimation and any assistant information, and decomposes the original video sequence into key frames and low-resolution frames with low entropy. The corresponding high-resolution frames would be reconstructed by combining the high-frequency and the low-frequency patches with smoothness constraints and the backpro-jection process. Experimental results demonstrate the objective and subjective efficiency in comparison with H.264/AVC and existing super-resolution reconstruction approaches.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132801335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eugen Wige, Gilbert Yammine, P. Amon, A. Hutter, André Kaup
{"title":"Adaptive in-loop noise-filtered prediction for High Efficiency Video Coding","authors":"Eugen Wige, Gilbert Yammine, P. Amon, A. Hutter, André Kaup","doi":"10.1109/MMSP.2011.6093773","DOIUrl":"https://doi.org/10.1109/MMSP.2011.6093773","url":null,"abstract":"Compression of noisy image sequences is a hard challenge in video coding. Especially for high quality compression the preprocessing of videos is not possible, as it decreases the objective quality of the videos. In order to overcome this problem, this paper presents an in-loop denoising framework for efficient medium to high fidelity compression of noisy video data. It is shown that using low complexity in-loop noise estimation and noise filtering as well as adaptive selection of the denoised inter frame predictors can improve the compression performance. The proposed algorithm for adaptive selection of the denoised predictor is based on the actual HEVC reference model. The different inter frame prediction modes within the current HEVC reference model are exploited for adaptive selection of denoised prediction by transmission of some side information in combination with decoder side estimation for denoised prediction. The simulation results show considerable gains using the proposed in-loop denoising framework with adaptive selection. In addition the theoretical bounds for the compression efficiency, if we could perfectly estimate the adaptive selection of the denoised prediction in the decoder, are shown in the simulation results.","PeriodicalId":214459,"journal":{"name":"2011 IEEE 13th International Workshop on Multimedia Signal Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134161360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}