C. Wah, Steve Branson, P. Perona, Serge J. Belongie
{"title":"Multiclass recognition and part localization with humans in the loop","authors":"C. Wah, Steve Branson, P. Perona, Serge J. Belongie","doi":"10.1109/ICCV.2011.6126539","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126539","url":null,"abstract":"We propose a visual recognition system that is designed for fine-grained visual categorization. The system is composed of a machine and a human user. The user, who is unable to carry out the recognition task by himself, is interactively asked to provide two heterogeneous forms of information: clicking on object parts and answering binary questions. The machine intelligently selects the most informative question to pose to the user in order to identify the object's class as quickly as possible. By leveraging computer vision and analyzing the user responses, the overall amount of human effort required, measured in seconds, is minimized. We demonstrate promising results on a challenging dataset of uncropped images, achieving a significant average reduction in human effort over previous methods.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"758 1","pages":"2524-2531"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76897216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering favorite views of popular places with iconoid shift","authors":"Tobias Weyand, B. Leibe","doi":"10.1109/ICCV.2011.6126361","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126361","url":null,"abstract":"In this paper, we propose a novel algorithm for automatic landmark building discovery in large, unstructured image collections. In contrast to other approaches which aim at a hard clustering, we regard the task as a mode estimation problem. Our algorithm searches for local attractors in the image distribution that have a maximal mutual homography overlap with the images in their neighborhood. Those attractors correspond to central, iconic views of single objects or buildings, which we efficiently extract using a medoid shift search with a novel distance measure. We propose efficient algorithms for performing this search. Most importantly, our approach performs only an efficient local exploration of the matching graph that makes it applicable for large-scale analysis of photo collections. We show experimental results validating our approach on a dataset of 500k images of the inner city of Paris.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"28 1","pages":"1132-1139"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80960585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiscale, curvature-based shape representation for surfaces","authors":"Ruirui Jiang, X. Gu","doi":"10.1109/ICCV.2011.6126457","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126457","url":null,"abstract":"This paper presents a multiscale, curvature-based shape representation technique for general genus zero closed surfaces. The method is invariant under rotation, translation, scaling, or general isometric deformations; it is robust to noise and preserves intrinsic symmetry. The method is a direct generalization of the Curvature Scale Space (CSS) shape descriptor for planar curves. In our method, the Riemannian metric of the surface is deformed under Ricci flow, such that the Gaussian curvature evolves according to a heat diffusion process. Eventually the surface becomes a sphere with constant positive curvature everywhere. The evolution of zero curvature curves on the surface is utilized as the shape descriptor. Our experimental results on a 3D geometric database with about 80 shapes demonstrate the efficiency and efficacy of the method.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"12 1","pages":"1887-1894"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80421542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hilde Kuehne, Hueihan Jhuang, Estíbaliz Garrote, T. Poggio, Thomas Serre
{"title":"HMDB: A large video database for human motion recognition","authors":"Hilde Kuehne, Hueihan Jhuang, Estíbaliz Garrote, T. Poggio, Thomas Serre","doi":"10.1109/ICCV.2011.6126543","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126543","url":null,"abstract":"With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lag far behind. Current action recognition databases contain on the order of ten different action categories collected under fairly controlled conditions. State-of-the-art performance on these datasets is now near ceiling and thus there is a need for the design and creation of new benchmarks. To address this issue we collected the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube. We use this database to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions such as camera motion, viewpoint, video quality and occlusion.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"12 1","pages":"2556-2563"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78661585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatiotemporal oriented energies for spacetime stereo","authors":"Mikhail Sizintsev, Richard P. Wildes","doi":"10.1109/ICCV.2011.6126362","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126362","url":null,"abstract":"This paper presents a novel approach to recovering temporally coherent estimates of 3D structure of a dynamic scene from a sequence of binocular stereo images. The approach is based on matching spatiotemporal orientation distributions between left and right temporal image streams, which encapsulates both local spatial and temporal structure for disparity estimation. By capturing spatial and temporal structure in this unified fashion, both sources of information combine to yield disparity estimates that are naturally temporal coherent, while helping to resolve matches that might be ambiguous when either source is considered alone. Further, by allowing subsets of the orientation measurements to support different disparity estimates, an approach to recovering multilayer disparity from spacetime stereo is realized. The approach has been implemented with real-time performance on commodity GPUs. Empirical evaluation shows that the approach yields qualitatively and quantitatively superior disparity estimates in comparison to various alternative approaches, including the ability to provide accurate multilayer estimates in the presence of (semi)transparent and specular surfaces.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"50 1","pages":"1140-1147"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76205513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From contours to 3D object detection and pose estimation","authors":"Nadia Payet, S. Todorovic","doi":"10.1109/ICCV.2011.6126342","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126342","url":null,"abstract":"This paper addresses view-invariant object detection and pose estimation from a single image. While recent work focuses on object-centered representations of point-based object features, we revisit the viewer-centered framework, and use image contours as basic features. Given training examples of arbitrary views of an object, we learn a sparse object model in terms of a few view-dependent shape templates. The shape templates are jointly used for detecting object occurrences and estimating their 3D poses in a new image. Instrumental to this is our new mid-level feature, called bag of boundaries (BOB), aimed at lifting from individual edges toward their more informative summaries for identifying object boundaries amidst the background clutter. In inference, BOBs are placed on deformable grids both in the image and the shape templates, and then matched. This is formulated as a convex optimization problem that accommodates invariance to non-rigid, locally affine shape deformations. Evaluation on benchmark datasets demonstrates our competitive results relative to the state of the art.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"56 1","pages":"983-990"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88965978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding egocentric activities","authors":"A. Fathi, Ali Farhadi, James M. Rehg","doi":"10.1109/ICCV.2011.6126269","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126269","url":null,"abstract":"We present a method to analyze daily activities, such as meal preparation, using video from an egocentric camera. Our method performs inference about activities, actions, hands, and objects. Daily activities are a challenging domain for activity recognition which are well-suited to an egocentric approach. In contrast to previous activity recognition methods, our approach does not require pre-trained detectors for objects and hands. Instead we demonstrate the ability to learn a hierarchical model of an activity by exploiting the consistent appearance of objects, hands, and actions that results from the egocentric context. We show that joint modeling of activities, actions, and objects leads to superior performance in comparison to the case where they are considered independently. We introduce a novel representation of actions based on object-hand interactions and experimentally demonstrate the superior performance of our representation in comparison to standard activity representations such as bag of words.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"03 1","pages":"407-414"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88903962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly supervised object detector learning with model drift detection","authors":"P. Siva, T. Xiang","doi":"10.1109/ICCV.2011.6126261","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126261","url":null,"abstract":"A conventional approach to learning object detectors uses fully supervised learning techniques which assumes that a training image set with manual annotation of object bounding boxes are provided. The manual annotation of objects in large image sets is tedious and unreliable. Therefore, a weakly supervised learning approach is desirable, where the training set needs only binary labels regarding whether an image contains the target object class. In the weakly supervised approach a detector is used to iteratively annotate the training set and learn the object model. We present a novel weakly supervised learning framework for learning an object detector. Our framework incorporates a new initial annotation model to start the iterative learning of a detector and a model drift detection method that is able to detect and stop the iterative learning when the detector starts to drift away from the objects of interest. We demonstrate the effectiveness of our approach on the challenging PASCAL 2007 dataset.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"40 1","pages":"343-350"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83153172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent Low-Rank Representation for subspace segmentation and feature extraction","authors":"Guangcan Liu, Shuicheng Yan","doi":"10.1109/ICCV.2011.6126422","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126422","url":null,"abstract":"Low-Rank Representation (LRR) [16, 17] is an effective method for exploring the multiple subspace structures of data. Usually, the observed data matrix itself is chosen as the dictionary, which is a key aspect of LRR. However, such a strategy may depress the performance, especially when the observations are insufficient and/or grossly corrupted. In this paper we therefore propose to construct the dictionary by using both observed and unobserved, hidden data. We show that the effects of the hidden data can be approximately recovered by solving a nuclear norm minimization problem, which is convex and can be solved efficiently. The formulation of the proposed method, called Latent Low-Rank Representation (LatLRR), seamlessly integrates subspace segmentation and feature extraction into a unified framework, and thus provides us with a solution for both subspace segmentation and feature extraction. As a subspace segmentation algorithm, LatLRR is an enhanced version of LRR and outperforms the state-of-the-art algorithms. Being an unsupervised feature extraction algorithm, LatLRR is able to robustly extract salient features from corrupted data, and thus can work much better than the benchmark that utilizes the original data vectors as features for classification. Compared to dimension reduction based methods, LatLRR is more robust to noise.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"75 1","pages":"1615-1622"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79185504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cluster-based color space optimizations","authors":"Cheryl Lau, W. Heidrich, Rafał K. Mantiuk","doi":"10.1109/ICCV.2011.6126366","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126366","url":null,"abstract":"Transformations between different color spaces and gamuts are ubiquitous operations performed on images. Often, these transformations involve information loss, for example when mapping from color to grayscale for printing, from multispectral or multiprimary data to tristimulus spaces, or from one color gamut to another. In all these applications, there exists a straightforward “natural” mapping from the source space to the target space, but the mapping is not bijective, resulting in information loss due to metamerism and similar effects. We propose a cluster-based approach for optimizing the transformation for individual images in a way that preserves as much of the information as possible from the source space while staying as faithful as possible to the natural mapping. Our approach can be applied to a host of color transformation problems including color to gray, gamut mapping, conversion of multispectral and multiprimary data to tristimulus colors, and image optimization for color deficient viewers.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":"45 1","pages":"1172-1179"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79268262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}