{"title":"Linear time offline tracking and lower envelope algorithms","authors":"Steve Gu, Ying Zheng, Carlo Tomasi","doi":"10.1109/ICCV.2011.6126451","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126451","url":null,"abstract":"Offline tracking of visual objects is particularly helpful in the presence of significant occlusions, when a frame-by-frame, causal tracker is likely to lose sight of the target. In addition, the trajectories found by offline tracking are typically smoother and more stable because of the global optimization this approach entails. In contrast with previous work, we show that this global optimization can be performed in O(MNT) time for T frames of video at M × N resolution, with the help of the generalized distance transform developed by Felzenszwalb and Huttenlocher [13]. Recognizing the importance of this distance transform, we extend the computation to a more general lower envelope algorithm in certain heterogeneous l1-distance metric spaces. The generalized lower envelope algorithm is of complexity O(MN(M+N)) and is useful for a more challenging offline tracking problem. Experiments show that trajectories found by offline tracking are superior to those computed by online tracking methods, and are computed at 100 frames per second.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79465099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extracting adaptive contextual cues from unlabeled regions","authors":"Congcong Li, Devi Parikh, Tsuhan Chen","doi":"10.1109/ICCV.2011.6126282","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126282","url":null,"abstract":"Existing approaches to contextual reasoning for enhanced object detection typically utilize other labeled categories in the images to provide contextual information. As a consequence, they inadvertently commit to the granularity of information implicit in the labels. Moreover, large portions of the images may not belong to any of the manually-chosen categories, and these unlabeled regions are typically neglected. In this paper, we overcome both these drawbacks and propose a contextual cue that exploits unlabeled regions in images. Our approach adaptively determines the granularity (scene, inter-object, intra-object, etc.) at which contextual information is captured. In order to extract the proposed contextual cue, we consider a scene to be a structured configuration of objects and regions; just as an object is a composition of parts. We thus learn our proposed “contextual meta-objects” using any off-the-shelf object detector, which makes our proposed cue widely accessible to the community. Our results show that incorporating our proposed cue provides a relative improvement of 12% over a state-of-the-art object detector on the challenging PASCAL dataset.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83168528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-view repetitive structure detection","authors":"Nianjuan Jiang, P. Tan, L. Cheong","doi":"10.1109/ICCV.2011.6126285","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126285","url":null,"abstract":"Symmetry, especially repetitive structures in architecture are universally demonstrated across countries and cultures. Existing detection methods mainly focus on the detection of planar patterns from a single image. It is difficult to apply them to detect repetitive structures in architecture, which abounds with non-planar 3D repetitive elements (such as balconies and windows) and curved surfaces. We study the repetitive structure detection problem from multiple images of such architecture. Our method jointly analyzes these images and a set of 3D points reconstructed from them by structure-from-motion algorithms. 3D points help to rectify geometric deformations and hypothesize possible lattice structures, while images provide denser color and texture information to evaluate and confirm these hypotheses. In the experiments, we compare our method with existing algorithm. We also show how our results might be used to assist image-based modeling.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87280288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Panoramic stereo video textures","authors":"V. Chapdelaine-Couture, M. Langer, S. Roy","doi":"10.1109/ICCV.2011.6126376","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126376","url":null,"abstract":"A panoramic stereo (or omnistereo) pair of images provides depth information from stereo up to 360 degrees around a central observer. Because omnistereo lenses or mirrors do not yet exist, synthesizing omnistereo images requires multiple stereo camera positions and baseline orientations. Recent omnistereo methods stitch together many small field of view images called slits which are captured by one or two cameras following a circular motion. However, these methods produce omnistereo images for static scenes only. The situation is much more challenging for dynamic scenes since stitching needs to occur over both space and time and should synchronize the motion between left and right views as much as possible. This paper presents the first ever method for synthesizing panoramic stereo video textures. The method uses full frames rather than slits and uses blending across seams rather than smoothing or matching based on graph cuts. The method produces loopable panoramic stereo videos that can be displayed up to 360 degrees around a viewer.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91233935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sorted Random Projections for robust texture classification","authors":"Li Liu, P. Fieguth, Gangyao Kuang, H. Zha","doi":"10.1109/ICCV.2011.6126267","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126267","url":null,"abstract":"This paper presents a simple and highly effective system for robust texture classification, based on (1) random local features, (2) a simple global Bag-of-Words (BoW) representation, and (3) Support Vector Machines (SVMs) based classification. The key contribution in this work is to apply a sorting strategy to a universal yet information-preserving random projection (RP) technique, then comparing two different texture image representations (histograms and signatures) with various kernels in the SVMs. We have tested our texture classification system on six popular and challenging texture databases for exemplar based texture classification, comparing with 12 recent state-of-the-art methods. Experimental results show that our texture classification system yields the best classification rates of which we are aware of 99.37% for CUReT, 97.16% for Brodatz, 99.30% for UMD and 99.29% for KTH-TIPS. Moreover, combining random features significantly outperforms the state-of-the-art descriptors in material categorization.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91302123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Center-surround divergence of feature statistics for salient object detection","authors":"D. A. Klein, S. Frintrop","doi":"10.1109/ICCV.2011.6126499","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126499","url":null,"abstract":"In this paper, we introduce a new method to detect salient objects in images. The approach is based on the standard structure of cognitive visual attention models, but realizes the computation of saliency in each feature dimension in an information-theoretic way. The method allows a consistent computation of all feature channels and a well-founded fusion of these channels to a saliency map. Our framework enables the computation of arbitrarily scaled features and local center-surround pairs in an efficient manner. We show that our approach outperforms eight state-of-the-art saliency detectors in terms of precision and recall.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89686725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iGroup: Weakly supervised image and video grouping","authors":"Andrew Gilbert, R. Bowden","doi":"10.1109/ICCV.2011.6126493","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126493","url":null,"abstract":"We present a generic, efficient and iterative algorithm for interactively clustering classes of images and videos. The approach moves away from the use of large hand labelled training datasets, instead allowing the user to find natural groups of similar content based upon a handful of “seed” examples. Two efficient data mining tools originally developed for text analysis; min-Hash and APriori are used and extended to achieve both speed and scalability on large image and video datasets. Inspired by the Bag-of-Words (BoW) architecture, the idea of an image signature is introduced as a simple descriptor on which nearest neighbour classification can be performed. The image signature is then dynamically expanded to identify common features amongst samples of the same class. The iterative approach uses APriori to identify common and distinctive elements of a small set of labelled true and false positive signatures. These elements are then accentuated in the signature to increase similarity between examples and “pull” positive classes together. By repeating this process, the accuracy of similarity increases dramatically despite only a few training examples, only 10% of the labelled groundtruth is needed, compared to other approaches. It is tested on two image datasets including the caltech101 [9] dataset and on three state-of-the-art action recognition datasets. On the YouTube [18] video dataset the accuracy increases from 72% to 97% using only 44 labelled examples from a dataset of over 1200 videos. The approach is both scalable and efficient, with an iteration on the full YouTube dataset taking around 1 minute on a standard desktop machine.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90265761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminative learning of relaxed hierarchy for large-scale visual recognition","authors":"Tianshi Gao, D. Koller","doi":"10.1109/ICCV.2011.6126481","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126481","url":null,"abstract":"In the real visual world, the number of categories a classifier needs to discriminate is on the order of hundreds or thousands. For example, the SUN dataset [24] contains 899 scene categories and ImageNet [6] has 15,589 synsets. Designing a multiclass classifier that is both accurate and fast at test time is an extremely important problem in both machine learning and computer vision communities. To achieve a good trade-off between accuracy and speed, we adopt the relaxed hierarchy structure from [15], where a set of binary classifiers are organized in a tree or DAG (directed acyclic graph) structure. At each node, classes are colored into positive and negative groups which are separated by a binary classifier while a subset of confusing classes is ignored. We color the classes and learn the induced binary classifier simultaneously using a unified and principled max-margin optimization. We provide an analysis on generalization error to justify our design. Our method has been tested on both Caltech-256 (object recognition) [9] and the SUN dataset (scene classification) [24], and shows significant improvement over existing methods.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73044785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image representation by active curves","authors":"Wenze Hu, Y. Wu, Song-Chun Zhu","doi":"10.1109/ICCV.2011.6126447","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126447","url":null,"abstract":"This paper proposes a sparse image representation using deformable templates of simple geometric structures that are commonly observed in images of natural scenes. These deformable templates include active curve templates and active corner templates. An active curve template is a composition of Gabor wavelet elements placed with equal spacing on a straight line segment or a circular arc segment of constant curvature, where each Gabor wavelet element is allowed to locally shift its location and orientation, so that the original line and arc segment of the active curve template can be deformed to fit the observed image. An active corner or angle template is a composition of two active curve templates that share a common end point, and the active curve templates are allowed to vary their overall lengths and curvatures, so that the original corner template can deform to match the observed image. This paper then proposes a hierarchical computational architecture of summax maps that pursues a sparse representation of an image by selecting a small number of active curve and corner templates from a dictionary of all such templates. Experiments show that the proposed method is capable of finding sparse representations of natural images. It is also shown that object templates can be learned by selecting and composing active curve and corner templates.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73575825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A linear subspace learning approach via sparse coding","authors":"Lei Zhang, Peng Fei Zhu, Q. Hu, D. Zhang","doi":"10.1109/ICCV.2011.6126313","DOIUrl":"https://doi.org/10.1109/ICCV.2011.6126313","url":null,"abstract":"Linear subspace learning (LSL) is a popular approach to image recognition and it aims to reveal the essential features of high dimensional data, e.g., facial images, in a lower dimensional space by linear projection. Most LSL methods compute directly the statistics of original training samples to learn the subspace. However, these methods do not effectively exploit the different contributions of different image components to image recognition. We propose a novel LSL approach by sparse coding and feature grouping. A dictionary is learned from the training dataset, and it is used to sparsely decompose the training samples. The decomposed image components are grouped into a more discriminative part (MDP) and a less discriminative part (LDP). An unsupervised criterion and a supervised criterion are then proposed to learn the desired subspace, where the MDP is preserved and the LDP is suppressed simultaneously. The experimental results on benchmark face image databases validated that the proposed methods outperform many state-of-the-art LSL schemes.","PeriodicalId":6391,"journal":{"name":"2011 International Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73745128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}