{"title":"Geographic information use in weakly-supervised deep learning for landmark recognition","authors":"Yifang Yin, Zhenguang Liu, Roger Zimmermann","doi":"10.1109/ICME.2017.8019376","DOIUrl":null,"url":null,"abstract":"The successful deep convolutional neural networks for visual object recognition typically rely on a massive number of training images that are well annotated by class labels or object bounding boxes with great human efforts. Here we explore the use of the geographic metadata, which are automatically retrieved from sensors such as GPS and compass, in weakly-supervised learning techniques for landmark recognition. The visibility of a landmark in a frame can be calculated based on the camera's field-of-view and the landmark's geometric information such as location and height. Subsequently, a training dataset is generated as the union of the frames with presence of at least one target landmark. To reduce the impact of the intrinsic noise in the geo-metadata, we present a frame selection method that removes the mistakenly labeled frames with a two-step approach consisting of (1) Gaussian Mixture Model clustering based on camera location followed by (2) outlier removal based on visual consistency. We compare the classification results obtained from the ground truth labels and the noisy labels derived from the raw geo-metadata. Experiments show that training based on the raw geo-metadata achieves a Mean Average Precision (MAP) of 0.797. Moreover, by applying our proposed representative frame selection method, the MAP can be further improved by 6.4%, which indicates the promising use of the geo-metadata in weakly-supervised learning techniques.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2017.8019376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
The successful deep convolutional neural networks for visual object recognition typically rely on a massive number of training images that are well annotated by class labels or object bounding boxes with great human efforts. Here we explore the use of the geographic metadata, which are automatically retrieved from sensors such as GPS and compass, in weakly-supervised learning techniques for landmark recognition. The visibility of a landmark in a frame can be calculated based on the camera's field-of-view and the landmark's geometric information such as location and height. Subsequently, a training dataset is generated as the union of the frames with presence of at least one target landmark. To reduce the impact of the intrinsic noise in the geo-metadata, we present a frame selection method that removes the mistakenly labeled frames with a two-step approach consisting of (1) Gaussian Mixture Model clustering based on camera location followed by (2) outlier removal based on visual consistency. We compare the classification results obtained from the ground truth labels and the noisy labels derived from the raw geo-metadata. Experiments show that training based on the raw geo-metadata achieves a Mean Average Precision (MAP) of 0.797. Moreover, by applying our proposed representative frame selection method, the MAP can be further improved by 6.4%, which indicates the promising use of the geo-metadata in weakly-supervised learning techniques.
用于视觉对象识别的成功的深度卷积神经网络通常依赖于大量的训练图像,这些图像通过类标签或对象边界框进行了很好的注释,并且需要大量的人力。本文探讨了地理元数据在弱监督学习技术中用于地标识别的使用,这些元数据是自动从GPS和指南针等传感器中检索的。根据相机的视场和地标的位置、高度等几何信息,可以计算出一帧中地标的可见性。随后,生成一个训练数据集,作为存在至少一个目标地标的帧的并集。为了减少地理元数据中固有噪声的影响,我们提出了一种帧选择方法,该方法采用两步方法去除错误标记的帧,该方法包括:(1)基于摄像机位置的高斯混合模型聚类,然后(2)基于视觉一致性的异常值去除。我们比较了地面真值标签和原始地理元数据的噪声标签的分类结果。实验表明,基于原始地理元数据的训练得到了0.797的Mean Average Precision (MAP)。此外,采用我们提出的代表性框架选择方法,MAP可以进一步提高6.4%,这表明地理元数据在弱监督学习技术中的应用前景广阔。