{"title":"Scene Text Detection and Tracking in Video with Background Cues","authors":"Lan Wang, Yang Wang, Susu Shan, Feng Su","doi":"10.1145/3206025.3206051","DOIUrl":"https://doi.org/10.1145/3206025.3206051","url":null,"abstract":"To detect scene text in the video is valuable to many content-based video applications. In this paper, we present a novel scene text detection and tracking method for videos, which effectively exploits the cues of the background regions of the text. Specifically, we first extract text candidates and potential background regions of text from the video frame. Then, we exploit the spatial, shape and motional correlations between the text and its background region with a bipartite graph model and the random walk algorithm to refine the text candidates for improved accuracy. We also present an effective tracking framework for text in the video, making use of the temporal correlation of text cues across successive frames, which contributes to enhancing both the precision and the recall of the final text detection result. Experiments on public scene text video datasets demonstrate the state-of-the-art performance of the proposed method.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"16 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal Aggregation of Visual Features for Large-Scale Image-to-Video Retrieval","authors":"Noa García","doi":"10.1145/3206025.3206083","DOIUrl":"https://doi.org/10.1145/3206025.3206083","url":null,"abstract":"In this research we study the specific task of image-to-video retrieval, in which static pictures are used to find a specific timestamp or frame within a collection of videos. The inner temporal structure of video data consists of a sequence of highly correlated images or frames, commonly reproduced at rates of 24 to 30 frames per second. To perform large-scale retrieval, it is necessary to reduce the amount of data to be processed by exploiting the redundancy between these highly correlated images. In this work, we explore several techniques to aggregate visual temporal information from video data based on both standard local features and deep learning representations with the focus on the image-to-video retrieval task.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127044333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Oral Session 1: Multimedia Retrieval","authors":"Qi Tan","doi":"10.1145/3252926","DOIUrl":"https://doi.org/10.1145/3252926","url":null,"abstract":"","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121683220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haonan Qiu, Yingbin Zheng, Hao Ye, Yao Lu, Feng Wang, Liang He
{"title":"Precise Temporal Action Localization by Evolving Temporal Proposals","authors":"Haonan Qiu, Yingbin Zheng, Hao Ye, Yao Lu, Feng Wang, Liang He","doi":"10.1145/3206025.3206029","DOIUrl":"https://doi.org/10.1145/3206025.3206029","url":null,"abstract":"Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130300548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Zeppelzauer, Miroslav Despotovic, Muntaha Sakeena, David Koch, M. Döller
{"title":"Automatic Prediction of Building Age from Photographs","authors":"M. Zeppelzauer, Miroslav Despotovic, Muntaha Sakeena, David Koch, M. Döller","doi":"10.1145/3206025.3206060","DOIUrl":"https://doi.org/10.1145/3206025.3206060","url":null,"abstract":"We present a first method for the automated age estimation of buildings from unconstrained photographs. To this end, we propose a two-stage approach that firstly learns characteristic visual patterns for different building epochs at patch-level and then globally aggregates patch-level age estimates over the building. We compile evaluation datasets from different sources and perform an detailed evaluation of our approach, its sensitivity to parameters, and the capabilities of the employed deep networks to learn characteristic visual age-related patterns. Results show that our approach is able to estimate building age at a surprisingly high level that even outperforms human evaluators and thereby sets a new performance baseline. This work represents a first step towards the automated assessment of building parameters for automated price prediction.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122436627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyu Yue, Bichen Wu, S. Seshia, K. Keutzer, A. Sangiovanni-Vincentelli
{"title":"A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving","authors":"Xiangyu Yue, Bichen Wu, S. Seshia, K. Keutzer, A. Sangiovanni-Vincentelli","doi":"10.1145/3206025.3206080","DOIUrl":"https://doi.org/10.1145/3206025.3206080","url":null,"abstract":"3D LiDAR scanners are playing an increasingly important role in autonomous driving as they can generate depth information of the environment. However, creating large 3D LiDAR point cloud datasets with point-level labels requires a significant amount of manual annotation. This jeopardizes the efficient development of supervised deep learning algorithms which are often data-hungry. We present a framework to rapidly create point clouds with accurate point-level labels from a computer game. To our best knowledge, this is the first publication on LiDAR point cloud simulation framework for autonomous driving. The framework supports data collection from both auto-driving scenes and user-configured scenes. Point clouds from auto-driving scenes can be used as training data for deep learning algorithms, while point clouds from user-configured scenes can be used to systematically test the vulnerability of a neural network, and use the falsifying examples to make the neural network more robust through retraining. In addition, the scene images can be captured simultaneously in order for sensor fusion tasks, with a method proposed to do automatic registration between the point clouds and captured scene images. We show a significant improvement in accuracy (+9%) in point cloud segmentation by augmenting the training dataset with the generated synthesized data. Our experiments also show by testing and retraining the network using point clouds from user-configured scenes, the weakness/blind spots of the neural network can be fixed.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132328793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transductive Zero-Shot Hashing via Coarse-to-Fine Similarity Mining","authors":"Hanjiang Lai, Yan Pan","doi":"10.1145/3206025.3206026","DOIUrl":"https://doi.org/10.1145/3206025.3206026","url":null,"abstract":"Zero-shot Hashing (ZSH) is to learn hashing models for novel/target classes without training data, which is an important and challenging problem. Most existing ZSH approaches exploit transfer learning via an intermediate shared semantic representations between the seen/source classes and novel/target classes. However, the hash functions learned from the source dataset may show poor performance when directly applied to the target classes due to the dataset bias. In this paper, we study the transductive ZSH, i.e., we have unlabeled data for novel classes. We put forward a simple yet efficient joint learning approach via coarse-to-fine similarity mining which transfers knowledges from source data to target data. It mainly consists of two building blocks in the proposed deep architecture: 1) a shared two-streams network to learn the effective common image representations. The first stream operates on the source data and the second stream operates on the unlabeled data. And 2) a coarse-to-fine module to transfer the similarities of the source data to the target data in a greedy fashion. It begins with a coarse search over the unlabeled data to find the images that most dissimilar to the source data, and then detects the similarities among the found images via the fine module. Extensive evaluation results on several benchmark datasets demonstrate that the proposed hashing method achieves significant improvement over the state-of-the-art methods.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126553175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Zhang, Liwei Wang, Junchi Yan, Xiangfeng Wang, H. Zha
{"title":"Deep Extreme Multi-label Learning","authors":"Wenjie Zhang, Liwei Wang, Junchi Yan, Xiangfeng Wang, H. Zha","doi":"10.1145/3206025.3206030","DOIUrl":"https://doi.org/10.1145/3206025.3206030","url":null,"abstract":"Extreme multi-label learning (XML) or classification has been a practical and important problem since the boom of big data. The main challenge lies in the exponential label space which involves 2L possible label sets especially when the label dimension L is huge, e.g., in millions for Wikipedia labels. This paper is motivated to better explore the label space by originally establishing an explicit label graph. In the meanwhile, deep learning has been widely studied and used in various classification problems including multi-label classification, however it has not been properly introduced to XML, where the label space can be as large as in millions. In this paper, we propose a practical deep embedding method for extreme multi-label classification, which harvests the ideas of non-linear embedding and graph priors-based label space modeling simultaneously. Extensive experiments on public datasets for XML show that our method performs competitive against state-of-the-art result.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123946063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongzhi Li, Joseph G. Ellis, Lei Zhang, Shih-Fu Chang
{"title":"PatternNet: Visual Pattern Mining with Deep Neural Network","authors":"Hongzhi Li, Joseph G. Ellis, Lei Zhang, Shih-Fu Chang","doi":"10.1145/3206025.3206039","DOIUrl":"https://doi.org/10.1145/3206025.3206039","url":null,"abstract":"Visual patterns represent the discernible regularity in the visual world. They capture the essential nature of visual objects or scenes. Understanding and modeling visual patterns is a fundamental problem in visual recognition that has wide ranging applications. In this paper, we study the problem of visual pattern mining and propose a novel deep neural network architecture called PatternNet for discovering these patterns that are both discriminative and representative. The proposed PatternNet leverages the filters in the last convolution layer of a convolutional neural network to find locally consistent visual patches, and by combining these filters we can effectively discover unique visual patterns. In addition, PatternNet can discover visual patterns efficiently without performing expensive image patch sampling, and this advantage provides an order of magnitude speedup compared to most other approaches. We evaluate the proposed PatternNet subjectively by showing randomly selected visual patterns which are discovered by our method and quantitatively by performing image classification with the identified visual patterns and comparing our performance with the current state-of-the-art. We also directly evaluate the quality of the discovered visual patterns by leveraging the identified patterns as proposed objects in an image and compare with other relevant methods. Our proposed network and procedure, PatterNet, is able to outperform competing methods for the tasks described.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122012691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}