{"title":"Saliency moments for image categorization","authors":"Miriam Redi, B. Mérialdo","doi":"10.1145/1991996.1992035","DOIUrl":"https://doi.org/10.1145/1991996.1992035","url":null,"abstract":"In this paper we present Saliency Moments, a new, holistic descriptor for image recognition inspired by two biological vision principles: the gist perception and the selective visual attention. While traditional image features extract either local or global discriminative properties from the visual content, we use a hybrid approach that exploits some coarsely localized information, i.e. the salient regions shape and contours, to build a global, low-dimensional image signature. Results show that this new type of image description outperforms the traditional global features on scene and object categorization, for a variety of challenging datasets. Moreover, we show that, when combined with other existing descriptors (SIFT, Color Moments, Wavelet Feature and Edge Histogram), the saliency-based features provide complementary information, improving the precision of a retrieval system we build for the TRECVID 2010.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132826758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Summarization of personal photologs using multidimensional content and context","authors":"Pinaki Sinha, S. Mehrotra, R. Jain","doi":"10.1145/1991996.1992000","DOIUrl":"https://doi.org/10.1145/1991996.1992000","url":null,"abstract":"In this paper, we propose a framework for generation of representative subset summaries from large personal photo collections. These summaries will help in effective sharing and browsing of the personal photos. We define three salient properties: quality, diversity and coverage that an informative summary should satisfy. We propose methods to compute these properties using multidimensional content and context data. The objective of summarization is modeled as an optimization of these properties, given the size constraints. We also propose metrics which will evaluate the photo summaries based on their representation of the larger corpus and the ability to satisfy user's information needs. We use a dataset of 40K personal photos collected by crawling photo sharing and storage sites of sixteen users. Our experiments show that the summarization algorithm works better than the baseline algorithms.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130004841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, Christian Wartena, Vanessa Murdock, G. Friedland, R. Ordelman, Gareth J.F. Jones
{"title":"Automatic tagging and geotagging in video collections and communities","authors":"M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, Christian Wartena, Vanessa Murdock, G. Friedland, R. Ordelman, Gareth J.F. Jones","doi":"10.1145/1991996.1992047","DOIUrl":"https://doi.org/10.1145/1991996.1992047","url":null,"abstract":"Automatically generated tags and geotags hold great promise to improve access to video collections and online communities. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126431006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding media illustrating events","authors":"Xueliang Liu, Raphael Troncy, B. Huet","doi":"10.1145/1991996.1992054","DOIUrl":"https://doi.org/10.1145/1991996.1992054","url":null,"abstract":"We present a method combining semantic inferencing and visual analysis for finding automatically media (photos and videos) illustrating events. We report on experiments validating our heuristic for mining media sharing platforms and large event directories in order to mutually enrich the descriptions of the content they host. Our overall goal is to design a web-based environment that allows users to explore and select events, to inspect associated media, and to discover meaningful, surprising or entertaining connections between events, media and people participating in events. We present a large dataset composed of semantic descriptions of events, photos and videos interlinked with the larger Linked Open Data cloud and we show the benefits of using semantic web technologies for integrating multimedia metadata.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128996346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding locations of flickr resources using language models and similarity search","authors":"O. Laere, S. Schockaert, B. Dhoedt","doi":"10.1145/1991996.1992044","DOIUrl":"https://doi.org/10.1145/1991996.1992044","url":null,"abstract":"We present a two-step approach to estimate where a given photo or video was taken, using only the tags that a user has assigned to it. In the first step, a language modeling approach is adopted to find the area which most likely contains the geographic location of the resource. In the subsequent second step, a precise location is determined within the area that was found to be most plausible. The main idea of this step is to compare the multimedia object under consideration with resources from the training set, for which the exact coordinates are known, and which were taken in that area. Our final estimation is then determined as a function of the coordinates of the most similar among these resources. Experimental results show this two-step approach to improve substantially over either language models or similarity search alone.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121305767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D model retrieval using accurate pose estimation and view-based similarity","authors":"A. Axenopoulos, George C. Litos, P. Daras","doi":"10.1145/1991996.1992037","DOIUrl":"https://doi.org/10.1145/1991996.1992037","url":null,"abstract":"In this paper, a novel framework for 3D object retrieval is presented. The paper focuses on the investigation of an accurate 3D model alignment method, which is achieved by combining two intuitive criteria, the plane reflection symmetry and rectilinearity. After proper positioning in a coordinate system, a set of 2D images (multi-views) are automatically generated from the 3D object, by taking views from uniformly distributed viewpoints. For each image, a set of flip-invariant shape descriptors is extracted. Taking advantage of both the pose estimation of the 3D objects and the flip-invariance property of the extracted descriptors, a new matching scheme for fast computation of 3D object dissimilarity is introduced. Experiments conducted in SHREC 2009 benchmark show the superiority of the pose estimation method over similar approaches, as well as the efficiency of the new matching scheme.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115301778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Tómasson, Hlynur Sigurþórsson, B. Jónsson, L. Amsaleg
{"title":"PhotoCube: effective and efficient multi-dimensional browsing of personal photo collections","authors":"G. Tómasson, Hlynur Sigurþórsson, B. Jónsson, L. Amsaleg","doi":"10.1145/1991996.1992066","DOIUrl":"https://doi.org/10.1145/1991996.1992066","url":null,"abstract":"It has never been so easy to take pictures, and personal image collections have never been so large. Unfortunately, most current photo browsers provide very limited support for effectively navigating image collections. This demonstration proposal describes PhotoCube, a personal image browser based on a multi-dimensional data model similar to the model used in OLAP applications. With PhotoCube, users can tag pictures, structure tags into various hierarchies, and browse images according to any possible perspective. We also describe three demonstration scenarios that show the power, flexibility and scalability of PhotoCube.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130988170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated detection of errors and quality issues in audio-visual content","authors":"Ronny Paduschek, S. Nowak, Uwe Kühhirt","doi":"10.1145/1991996.1992070","DOIUrl":"https://doi.org/10.1145/1991996.1992070","url":null,"abstract":"This paper presents a demonstration of a technology for automated detection of errors and quality issues in audio-visual content. The extensible system, called AVInspector, consists of a signal processing core combined with detection modules for universally applicable audio-visual analysis. The detection modules include algorithms to assess blocking artefacts, picture freezes, clipping, dropouts and noise. The technology can be used for both a stand-alone application and as library for integration within professional products. Typical purposes are the automated observing and monitoring of audio-visual applications. This includes the automated analysis of audio and video material at ingest or within an archive, real-time observations of streaming services and broadcasts as well as the assessment of transcoding results for multi-platform playout.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128745726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Kennedy, R. V. Zwol, Nicolas Torzec, Belle L. Tseng
{"title":"Learning crop regions for content-aware generation of thumbnail images","authors":"L. Kennedy, R. V. Zwol, Nicolas Torzec, Belle L. Tseng","doi":"10.1145/1991996.1992026","DOIUrl":"https://doi.org/10.1145/1991996.1992026","url":null,"abstract":"We propose a model for automatically cropping images based on a diverse set of content and spatial features. We approach this by extracting pixel-level features and aggregating them over possible crop regions. We then learn a regression model to predict the quality of the crop regions, via the degree to which they would overlaps with human-provided crops from these input features. Candidate images can then be cropped based an exhaustive sweep over candidate crop regions, where each region is scored and the highest-scoring region is retained. The system is unique in its ability to incorporate a variety of pixel-level importance cues when arriving at a final cropping recommendation. We test the system on a set of human-cropped images with a large set of features. We find that the system outperforms baseline approaches, particularly when the aspect ratio of the image is very different from the target thumbnail region.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"30 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128186567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-modal, multi-resource methods for placing Flickr videos on the map","authors":"P. Kelm, S. Schmiedeke, T. Sikora","doi":"10.1145/1991996.1992048","DOIUrl":"https://doi.org/10.1145/1991996.1992048","url":null,"abstract":"We present three approaches for placing videos in Flickr on the world map. The toponym extraction and geo lookup approach makes use of external resources to identify toponyms in the metadata and associate them with geo-coordinates. The metadata-based region model approach uses a k-nearest-neighbour classifier trained over geographical regions. Videos are represented using their metadata in a text space with reduced dimensionality. The visual region model approach uses a support vector machine also trained over geographical regions. Videos are represented using low-level feature vectors from multiple key frames. Voting methods are used to form a single decision for each video. We compare the approaches experimentally, highlighting the importance of using appropriate metadata features and suitable regions as the basis of the region model. The best performance is achieved by the geo-lookup approach used with fallback to the visual region model when the video metadata contains no toponym.","PeriodicalId":390933,"journal":{"name":"Proceedings of the 1st ACM International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130081929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}