{"title":"Recognition of Multiple-Food Images by Detecting Candidate Regions","authors":"Yuji Matsuda, H. Hoashi, Keiji Yanai","doi":"10.1109/ICME.2012.157","DOIUrl":"https://doi.org/10.1109/ICME.2012.157","url":null,"abstract":"In this paper, we propose a two-step method to recognize multiple-food images by detecting candidate regions with several methods and classifying them with various kinds of features. In the first step, we detect several candidate regions by fusing outputs of several region detectors including Felzenszwalb's deformable part model (DPM) [1], a circle detector and the JSEG region segmentation. In the second step, we apply a feature-fusion-based food recognition method for bounding boxes of the candidate regions with various kinds of visual features including bag-of-features of SIFT and CSIFT with spatial pyramid (SP-BoF), histogram of oriented gradient (HoG), and Gabor texture features. In the experiments, we estimated ten food candidates for multiple-food images in the descending order of the confidence scores. As results, we have achieved the 55.8% classification rate, which improved the baseline result in case of using only DPM by 14.3 points, for a multiple-food image data set. This demonstrates that the proposed two-step method is effective for recognition of multiple-food images.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122350469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video Copy Detection Using a Soft Cascade of Multimodal Features","authors":"Menglin Jiang, Yonghong Tian, Tiejun Huang","doi":"10.1109/ICME.2012.189","DOIUrl":"https://doi.org/10.1109/ICME.2012.189","url":null,"abstract":"In the video copy detection task, it is widely recognized that none of any single feature can work well for all transformations. Thus more and more approaches adopt a set of complementary features to cope with complex audio-visual transformations. However, most of them utilize individual features separately and the final result is obtained by fusing results of several basic detectors. Often, this will lead to low detection efficiency. Moreover, there are some thresholds or parameters to be elaborately tuned. To address these problems, we propose a soft cascade approach to integrate multiple features for efficient copy detection. In our approach, basic detectors are organized in a cascaded framework, which processes a query video in sequence until one detector asserts it as a copy. To fully exert the complementarity of these detectors, a learning algorithm is proposed to estimate the optimal decision thresholds in the cascade architecture. Excellent performance on the benchmark dataset of TRECVid 2011 CBCD task demonstrates the effectiveness and efficiency of our approach.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"263 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIFT-Based Image Compression","authors":"Huanjing Yue, Xiaoyan Sun, Feng Wu, Jingyu Yang","doi":"10.1109/ICME.2012.52","DOIUrl":"https://doi.org/10.1109/ICME.2012.52","url":null,"abstract":"This paper proposes a novel image compression scheme based on the local feature descriptor - Scale Invariant Feature Transform (SIFT). The SIFT descriptor characterizes an image region invariantly to scale and rotation. It is used widely in image retrieval. By using SIFT descriptors, our compression scheme is able to make use of external image contents to reduce visual redundancy among images. The proposed encoder compresses an input image by SIFT descriptors rather than pixel values. It separates the SIFT descriptors of the image into two groups, a visual description which is a significantly sub sampled image with key SIFT descriptors embedded and a set of differential SIFT descriptors, to reduce the coding bits. The corresponding decoder generates the SIFT descriptors from the visual description and the differential set. The SIFT descriptors are used in our SIFT-based matching to retrieve the candidate predictive patches from a large image dataset. These candidate patches are then integrated into the visual description, presenting the final reconstructed images. Our preliminary but promising results demonstrate the effectiveness of our proposed image coding scheme towards perceptual quality. Our proposed image compression scheme provides a feasible approach to make use of the visual correlation among images.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115310106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomonari Yoshida, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase
{"title":"Robust Face Super-Resolution Using Free-Form Deformations for Low-Quality Surveillance Video","authors":"Tomonari Yoshida, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase","doi":"10.1109/ICME.2012.162","DOIUrl":"https://doi.org/10.1109/ICME.2012.162","url":null,"abstract":"Recently, the demand for face recognition to identify persons from surveillance video cameras has rapidly increased. Since surveillance cameras are usually placed at positions far from a person's face, the quality of face images captured by the cameras tends to be low. This degrades the recognition accuracy. Therefore, aiming to improve the accuracy of the low-resolution-face recognition, we propose a video-based super-resolution method. The proposed method can generate a high-resolution face image from low-resolution video frames including non-rigid deformations caused by changes of face poses and expressions without using any positional information of facial feature points. Most existing techniques use the facial feature points for image alignment between the video frames. However, it is difficult to obtain the accurate positions of the feature points from low-resolution face images. To achieve the alignment, the proposed method uses a free-form deformation method that flexibly aligns each local region between the images. This enables super-resolution of face images from low-resolution videos. Experimental results demonstrated that the proposed method improved the performance of super-resolution for actual videos in terms of both image quality and face recognition accuracy.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123781335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bringing Videos to Social Media","authors":"S. Kopf, Stefan Wilk, W. Effelsberg","doi":"10.1109/ICME.2012.86","DOIUrl":"https://doi.org/10.1109/ICME.2012.86","url":null,"abstract":"Although the importance of video sharing and of social media is increasing from day to day, a full integration of videos into social media is not achieved yet. We have developed a system that maps the concept of hypervideo - allowing to annotate objects in a video - to social media. We define this combination as social video that simultaneously allows a large number of users to contribute to the content of a video. Users can annotate video objects by adding images, text, other videos, Web links, or even communication topics. An integrated chat system allows users to communicate with friends and to link these topics to distinct objects in the video. We analyze the technical functionality and the user acceptance of our social video system in detail. Due to the integration into the social network Facebook more than 12,000 users have already accessed our system.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123203186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-Time Storyboard Generation for H.264/AVC Compressed Videos","authors":"Pei Dong, Yong Xia, D. Feng","doi":"10.1109/ICME.2012.49","DOIUrl":"https://doi.org/10.1109/ICME.2012.49","url":null,"abstract":"Video summarization enables convenient and efficient management of large volume of visual data. However, most existing summarization approaches are based on either the pixel domain information or conventional video compression standards. As the most recent and popular international video coding standard, H.264/AVC adopts a number of advanced techniques and brings not only opportunities but also challenges to video summarization. In this paper, we propose a real-time image storyboard generation algorithm for H.264/AVC compressed videos by using both compressed domain and pixel domain information jointly and adaptively. This algorithm extracts compressed domain information for visual content representation, video structuring and candidate representative frame selection. By fusing both compressed domain and pixel domain information, the redundancy in the candidate representative frames is further reduced. Our experimental results show that the proposed algorithm can efficiently produce image storyboards conforming to human interpretation of the essential content in generic videos.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123270951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Structured Sparsity for Image Deblurring","authors":"Haichao Zhang, Yanning Zhang, Thomas S. Huang","doi":"10.1109/ICME.2012.110","DOIUrl":"https://doi.org/10.1109/ICME.2012.110","url":null,"abstract":"Sparsity is an ubiquitous property exhibited by many natural real-world data such as images, which has been playing an important role in image and multi-media data processing. However, for many data, such as images, the sparsity pattern is not completely random, i.e., there are structures over the sparse coefficients. By exploiting this structure, we can model the data better and may further improve the performance of the recovery algorithm. In this paper, we exploit the structured sparsity of natural images for image deblurring application. Experimental results clearly demonstrate the effectiveness of the proposed approach.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"755 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123278255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Position-Patch Based Face Hallucination via Locality-Constrained Representation","authors":"Junjun Jiang, R. Hu, Zhen Han, T. Lu, Kebin Huang","doi":"10.1109/ICME.2012.152","DOIUrl":"https://doi.org/10.1109/ICME.2012.152","url":null,"abstract":"Instead of using probabilistic graph based or manifold learning based models, some approaches based on position-patch have been proposed for face hallucination recently. In order to obtain the optimal weights for face hallucination, they represent image patches through those patches at the same position of training face images by employing least square estimation or convex optimization. However, they can hope neither to provide unbiased solutions nor to satisfy locality conditions, thus the obtained patch representation is not the best. In this paper, a simpler but more effective representation scheme- Locality-constrained Representation (LcR) has been developed, compared with the Least Square Representation (LSR) and Sparse Representation (SR). It imposes a locality constraint onto the least square inversion problem to reach sparsity and locality simultaneously. Experimental results demonstrate the superiority of the proposed method over some state-of-the-art face hallucination approaches.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"2006 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123802909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-Time Hand Pose Estimation from RGB-D Sensor","authors":"Y. Yao, Y. Fu","doi":"10.1109/ICME.2012.48","DOIUrl":"https://doi.org/10.1109/ICME.2012.48","url":null,"abstract":"Hand pose estimation in cluttered environment is always challenging. In this paper, we address the problem of hand pose estimation from RGB-D sensor. To achieve robust real-time usability, we first design a data acquisition strategy, using a color glove to label different hand parts, and collect a new training data set. Then a novel hand pose estimation framework is presented, so that feature fusion drives hand localization and hand parts classification. Moreover, instead of using articulated model, a simplified and efficient 3D contour model is designed to assist real-time implementation, which does not require a large amount of training data. Experiments show that our approach can handle real-time hand interaction in a desktop environments with cluttered background.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131608872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Social Photo Navigation Patterns","authors":"Luca Chiarandini, Michele Trevisiol, A. Jaimes","doi":"10.1109/ICME.2012.96","DOIUrl":"https://doi.org/10.1109/ICME.2012.96","url":null,"abstract":"In general, user browsing behavior has been examined within specific tasks (e.g., search), or in the context of particular web sites or services ( e.g., in shopping sites). However, with the growth of social networks and the proliferation of many different types of web services ( e.g., news aggregators, blogs, forums, etc.), the web can be viewed as an ecosystem in which a user's actions in a particular web service may be influenced by the service she arrived from ( e.g., are users browsing patterns similar if they arrive at a website via search or via links in aggregators?). In particular, since photos in services like Flickr are used extensively throughout the web, it is common for visitors to the site to arrive via links in many different types of web sites. In this paper, we depart from the hypothesis that visitors to social sites such as Flickr behave differently depending on where they come from. For this purpose, we analyze a large sample of Flickr user logs to discover social photo navigation patterns. More specifically, we classify pages within Flickr into different categories ( e.g., \"add a friend page\", \"single photo page,\" etc.), and by clustering sessions discover important differences in social photo navigation that manifest themselves depending on the type of site users visit before visiting Flickr. Our work examines photo navigation patterns in Flickr for the first time taking into account the referrer domain. Our analysis is useful in that it can contribute to a better understanding of how people use photo services like Flickr, and it can be used to inform the design of user modeling and recommendation algorithms, among others.","PeriodicalId":273567,"journal":{"name":"2012 IEEE International Conference on Multimedia and Expo","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131890336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}